Skip to content

[mlir][gpu][NVPTX] Enable NVIDIA GPU JIT compilation path #66220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 14, 2023

Conversation

fabianmcg
Copy link
Contributor

This patch adds an NVPTX compilation path that enables JIT compilation on NVIDIA targets. The following modifications were performed:

  1. Adding a format field to the GPU object attribute, allowing the translation attribute to use the correct runtime function to load the module. Likewise, a dictionary attribute was added to add any possible extra options.

  2. Adding the createObject method to GPUTargetAttrInterface; this method returns a GPU object from a binary string.

  3. Adding the function mgpuModuleLoadJIT, which is only available for NVIDIA GPUs, as there is no equivalent for AMD.

  4. Adding the CMake flag MLIR_GPU_COMPILATION_TEST_FORMAT to specify the format to use during testing.

NOTE:

  1. Not all tests are using MLIR_GPU_COMPILATION_TEST_FORMAT.
  2. An option needs to be added to the SparseCompiler to support the format option, however I didn't know if there's any preference.
  3. I'm basing the implementation of mgpuModuleLoadJIT on the assumption there's a JIT cache. Another option is to implement the cache itself in MLIR.

This patch adds an NVPTX compilation path that enables JIT compilation on NVIDIA
targets. The following modifications were performed:
1. Adding a format field to the GPU object attribute, allowing the translation
attribute to use the correct runtime function to load the module. Likewise, a
dictionary attribute was added to add any possible extra options.

2. Adding the "createObject" method to "GPUTargetAttrInterface"; this method
returns a GPU object from a binary string.

3. Adding the function "mgpuModuleLoadJIT", which is only available for NVIDIA GPUs,
as there is no equivalent for AMD.

4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify the format
to use during testing.
@llvmbot
Copy link
Member

llvmbot commented Sep 13, 2023

@llvm/pr-subscribers-mlir-sparse
@llvm/pr-subscribers-mlir-llvm

@llvm/pr-subscribers-mlir

Changes This patch adds an NVPTX compilation path that enables JIT compilation on NVIDIA targets. The following modifications were performed: 1. Adding a format field to the GPU object attribute, allowing the translation attribute to use the correct runtime function to load the module. Likewise, a dictionary attribute was added to add any possible extra options.
  1. Adding the createObject method to GPUTargetAttrInterface; this method returns a GPU object from a binary string.

  2. Adding the function mgpuModuleLoadJIT, which is only available for NVIDIA GPUs, as there is no equivalent for AMD.

  3. Adding the CMake flag MLIR_GPU_COMPILATION_TEST_FORMAT to specify the format to use during testing.

NOTE:

  1. Not all tests are using MLIR_GPU_COMPILATION_TEST_FORMAT.
  2. An option needs to be added to the SparseCompiler to support the format option, however I didn't know if there's any preference.
  3. I'm basing the implementation of mgpuModuleLoadJIT on the assumption there's a JIT cache. Another option is to implement the cache itself in MLIR.
    --

Patch is 50.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66220.diff

33 Files Affected:

  • (modified) mlir/include/mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td (+12-3)
  • (modified) mlir/include/mlir/Dialect/GPU/IR/CompilationAttrs.td (+23-2)
  • (modified) mlir/include/mlir/Dialect/GPU/IR/CompilationInterfaces.h (+16-22)
  • (modified) mlir/include/mlir/Dialect/GPU/Transforms/Passes.td (+1-2)
  • (modified) mlir/lib/Dialect/GPU/IR/GPUDialect.cpp (+44-5)
  • (modified) mlir/lib/Dialect/GPU/Transforms/ModuleToBinary.cpp (+18-18)
  • (modified) mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp (+21)
  • (modified) mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp (+5)
  • (modified) mlir/lib/Target/LLVM/NVVM/Target.cpp (+31-7)
  • (modified) mlir/lib/Target/LLVM/ROCDL/Target.cpp (+19-2)
  • (modified) mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp (+69-21)
  • (modified) mlir/test/CMakeLists.txt (+2)
  • (modified) mlir/test/Dialect/GPU/module-to-binary-nvvm.mlir (+3-3)
  • (modified) mlir/test/Dialect/GPU/module-to-binary-rocdl.mlir (+3-3)
  • (modified) mlir/test/Dialect/GPU/ops.mlir (+10)
  • (modified) mlir/test/Integration/GPU/CUDA/all-reduce-and.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/all-reduce-max.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/all-reduce-min.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/all-reduce-op.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/all-reduce-or.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/all-reduce-region.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/all-reduce-xor.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/async.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/gpu-to-cubin.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/lit.local.cfg (+2)
  • (modified) mlir/test/Integration/GPU/CUDA/multiple-all-reduce.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/printf.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/shuffle.mlir (+1-1)
  • (modified) mlir/test/Integration/GPU/CUDA/two-modules.mlir (+1-1)
  • (modified) mlir/test/lib/Dialect/GPU/TestLowerToNVVM.cpp (+7-1)
  • (modified) mlir/test/lit.site.cfg.py.in (+1)
  • (modified) mlir/unittests/Target/LLVM/SerializeNVVMTarget.cpp (+3-3)
  • (modified) mlir/unittests/Target/LLVM/SerializeROCDLTarget.cpp (+3-3)

<pre>
diff --git a/mlir/include/mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td b/mlir/include/mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td
index 5255286619e3bf2..160730480394272 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td
@@ -33,12 +33,21 @@ def GPUTargetAttrInterface : AttrInterface&lt;&quot;TargetAttrInterface&quot;&gt; {

     If serialization fails then the method should return `std::nullopt`.
  •    The `module` argument must be a GPU Module Op. The `options` argument is
    
  •    meant to be used for passing additional options that are not in the
    
  •    The `module` parameter must be a GPU Module Op. The `options` parameter
    
  •    is meant to be used for passing additional options that are not in the
       attribute.
     }],
     &amp;quot;std::optional&amp;lt;SmallVector&amp;lt;char, 0&amp;gt;&amp;gt;&amp;quot;, &amp;quot;serializeToObject&amp;quot;,
    
  •  (ins &amp;quot;Operation*&amp;quot;:$module, &amp;quot;const gpu::TargetOptions&amp;amp;&amp;quot;:$options)&amp;gt;
    
  •  (ins &amp;quot;Operation*&amp;quot;:$module, &amp;quot;const gpu::TargetOptions&amp;amp;&amp;quot;:$options)&amp;gt;,
    
  • InterfaceMethod&lt;[{
  •    Creates a GPU object attribute from a binary string.
    
  •    The `object` parameter is a binary string. The `options` parameter is
    
  •    meant to be used for passing additional options that are not in the
    
  •    attribute.
    
  •  }], &amp;quot;Attribute&amp;quot;, &amp;quot;createObject&amp;quot;,
    
  •    (ins &amp;quot;const SmallVector&amp;lt;char, 0&amp;gt;&amp;amp;&amp;quot;:$object,
    
  •         &amp;quot;const gpu::TargetOptions&amp;amp;&amp;quot;:$options)&amp;gt;
    
    ];
    }

diff --git a/mlir/include/mlir/Dialect/GPU/IR/CompilationAttrs.td b/mlir/include/mlir/Dialect/GPU/IR/CompilationAttrs.td
index 9c1110d8e9a9463..3d2e9848a2b25a0 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/CompilationAttrs.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/CompilationAttrs.td
@@ -20,6 +20,18 @@ include &quot;mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td&quot;
// GPU object attribute.
//===----------------------------------------------------------------------===//

+def GPU_ObjectOffload : I32EnumAttrCase&lt;&quot;Offload&quot;, 1, &quot;offload&quot;&gt;;
+def GPU_ObjectISA : I32EnumAttrCase&lt;&quot;Assembly&quot;, 2, &quot;assembly&quot;&gt;;
+def GPU_ObjectBinary : I32EnumAttrCase&lt;&quot;Binary&quot;, 3, &quot;bin&quot;&gt;;
+def GPU_ObjectFatbin : I32EnumAttrCase&lt;&quot;Fatbin&quot;, 4, &quot;fatbin&quot;&gt;;
+def GPU_CompilationTargetEnum : GPU_I32Enum&lt;

  • &quot;CompilationTarget&quot;, &quot;GPU object format&quot;, [
  • GPU_ObjectOffload,
  • GPU_ObjectISA,
  • GPU_ObjectBinary,
  • GPU_ObjectFatbin
  • ]&gt;;

def GPU_ObjectAttr : GPU_Attr&lt;&quot;Object&quot;, &quot;object&quot;&gt; {
let description = [{
A GPU object attribute pairs a GPU target with a binary string,
@@ -32,8 +44,17 @@ def GPU_ObjectAttr : GPU_Attr&lt;&quot;Object&quot;, &quot;object&quot;&gt; {
#gpu.object&lt;#nvvm.target, &quot;...&quot;&gt;
```
}];

  • let parameters = (ins &quot;Attribute&quot;:$target, &quot;StringAttr&quot;:$object);
  • let assemblyFormat = [{&amp;lt; $target , $object &amp;gt;}];
  • let parameters = (ins
  • &quot;Attribute&quot;:$target,
  • DefaultValuedParameter&lt;&quot;CompilationTarget&quot;, &quot;CompilationTarget::Fatbin&quot;&gt;:$format,
  • &quot;StringAttr&quot;:$object,
  • OptionalParameter&lt;&quot;DictionaryAttr&quot;&gt;:$properties
  • );
  • let assemblyFormat = [{ &amp;lt;
  •  $target `,`  (`properties` `=` $properties ^ `,`)?
    
  •  custom&amp;lt;Object&amp;gt;($format, $object)
    
  • &amp;gt;
  • }];
    let genVerifyDecl = 1;
    }

diff --git a/mlir/include/mlir/Dialect/GPU/IR/CompilationInterfaces.h b/mlir/include/mlir/Dialect/GPU/IR/CompilationInterfaces.h
index a1f64be57fa699d..ee7daed58f98314 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/CompilationInterfaces.h
+++ b/mlir/include/mlir/Dialect/GPU/IR/CompilationInterfaces.h
@@ -25,6 +25,8 @@ namespace LLVM {
class ModuleTranslation;
}
namespace gpu {
+enum class CompilationTarget : uint32_t;
+
/// This class indicates that the attribute associated with this trait is a GPU
/// offloading translation attribute. These kinds of attributes must implement
/// an interface for handling the translation of GPU offloading operations like
@@ -42,27 +44,15 @@ class OffloadingTranslationAttrTrait
/// ensure type safeness. Targets are free to ignore these options.
class TargetOptions {
public:

  • /// The target representation of the compilation process.
  • typedef enum {
  • offload = 1, /// The process should produce an offloading representation.
  •              /// For the NVVM &amp;amp; ROCDL targets this option produces LLVM IR.
    
  • assembly = 2, /// The process should produce assembly code.
  • binary = 4, /// The process should produce a binary.
  • fatbinary = 8, /// The process should produce a fat binary.
  • binOrFatbin =
  •    binary |
    
  •    fatbinary, /// The process should produce a binary or fatbinary. It&amp;#x27;s up
    
  •               /// to the target to decide which.
    
  • } CompilationTarget;
  • /// Constructor initializing the toolkit path, the list of files to link to,
    /// extra command line options, the compilation target and a callback for
    /// obtaining the parent symbol table. The default compilation target is
    /// binOrFatbin.
  • TargetOptions(StringRef toolkitPath = {},
  •            ArrayRef&amp;lt;std::string&amp;gt; linkFiles = {}, StringRef cmdOptions = {},
    
  •            CompilationTarget compilationTarget = binOrFatbin,
    
  •            function_ref&amp;lt;SymbolTable *()&amp;gt; getSymbolTableCallback = {});
    
  • TargetOptions(

  •  StringRef toolkitPath = {}, ArrayRef&amp;lt;std::string&amp;gt; linkFiles = {},
    
  •  StringRef cmdOptions = {},
    
  •  CompilationTarget compilationTarget = getDefaultCompilationTarget(),
    
  •  function_ref&amp;lt;SymbolTable *()&amp;gt; getSymbolTableCallback = {});
    

    /// Returns the typeID.
    TypeID getTypeID() const;
    @@ -90,13 +80,17 @@ class TargetOptions {
    /// table.
    SymbolTable *getSymbolTable() const;

  • /// Returns the default compilation target: CompilationTarget::Fatbin.

  • static CompilationTarget getDefaultCompilationTarget();

protected:
/// Derived classes must use this constructor to initialize typeID to the
/// appropiate value: ie. TargetOptions(TypeID::get&amp;lt;DerivedClass&amp;gt;()).

  • TargetOptions(TypeID typeID, StringRef toolkitPath = {},
  •            ArrayRef&amp;lt;std::string&amp;gt; linkFiles = {}, StringRef cmdOptions = {},
    
  •            CompilationTarget compilationTarget = binOrFatbin,
    
  •            function_ref&amp;lt;SymbolTable *()&amp;gt; getSymbolTableCallback = {});
    
  • TargetOptions(

  •  TypeID typeID, StringRef toolkitPath = {},
    
  •  ArrayRef&amp;lt;std::string&amp;gt; linkFiles = {}, StringRef cmdOptions = {},
    
  •  CompilationTarget compilationTarget = getDefaultCompilationTarget(),
    
  •  function_ref&amp;lt;SymbolTable *()&amp;gt; getSymbolTableCallback = {});
    

    /// Path to the target toolkit.
    std::string toolkitPath;
    @@ -108,7 +102,7 @@ class TargetOptions {
    /// process.
    std::string cmdOptions;

  • /// Compilation process target representation.
  • /// Compilation process target format.
    CompilationTarget compilationTarget;

    /// Callback for obtaining the parent symbol table of all the GPU modules
    diff --git a/mlir/include/mlir/Dialect/GPU/Transforms/Passes.td b/mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
    index 0bfb2750992058f..3de8e18851369df 100644
    --- a/mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
    +++ b/mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
    @@ -68,7 +68,6 @@ def GpuModuleToBinaryPass
    2. assembly, isa: produces assembly code.
    3. binary, bin: produces binaries.
    4. fatbinary, fatbin: produces fatbinaries.

    1. binOrFatbin: produces bins or fatbins, the target decides which.
      }];
      let options = [
      Option&lt;&quot;offloadingHandler&quot;, &quot;handler&quot;, &quot;Attribute&quot;, &quot;nullptr&quot;,
      @@ -79,7 +78,7 @@ def GpuModuleToBinaryPass
      &quot;Extra files to link to.&quot;&gt;,
      Option&lt;&quot;cmdOptions&quot;, &quot;opts&quot;, &quot;std::string&quot;, [{&quot;&quot;}],
      &quot;Command line options to pass to the tools.&quot;&gt;,
  • Option&lt;&quot;compilationTarget&quot;, &quot;format&quot;, &quot;std::string&quot;, [{&quot;binOrFatbin&quot;}],
  • Option&lt;&quot;compilationTarget&quot;, &quot;format&quot;, &quot;std::string&quot;, [{&quot;fatbin&quot;}],
    &quot;The target representation of the compilation process.&quot;&gt;
    ];
    }
    diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
    index fde379cd0afe13f..5eb2cadc884e151 100644
    --- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
    +++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
    @@ -1959,7 +1959,8 @@ void AllocOp::getCanonicalizationPatterns(RewritePatternSet &amp;results,
    //===----------------------------------------------------------------------===//

LogicalResult ObjectAttr::verify(function_ref&lt;InFlightDiagnostic()&gt; emitError,

  •                             Attribute target, StringAttr object) {
    
  •                             Attribute target, CompilationTarget format,
    
  •                             StringAttr object, DictionaryAttr properties) {
    
    if (!target)
    return emitError() &lt;&lt; &quot;the target attribute cannot be null&quot;;
    if (target.hasPromiseOrImplementsInterface&lt;TargetAttrInterface&gt;())
    @@ -1968,6 +1969,40 @@ LogicalResult ObjectAttr::verify(function_ref&lt;InFlightDiagnostic()&gt; emitError,
    &quot;gpu::TargetAttrInterface&quot;;
    }

+namespace {
+LogicalResult parseObject(AsmParser &amp;odsParser, CompilationTarget &amp;format,

  •                      StringAttr &amp;amp;object) {
    
  • std::optional&lt;CompilationTarget&gt; formatResult;
  • StringRef enumKeyword;
  • auto loc = odsParser.getCurrentLocation();
  • if (failed(odsParser.parseOptionalKeyword(&amp;enumKeyword)))
  • formatResult = CompilationTarget::Fatbin;
  • if (!formatResult &amp;&amp;
  •  (formatResult =
    
  •       gpu::symbolizeEnum&amp;lt;gpu::CompilationTarget&amp;gt;(enumKeyword)) &amp;amp;&amp;amp;
    
  •  odsParser.parseEqual())
    
  • return odsParser.emitError(loc, &quot;expected an equal sign&quot;);
  • if (!formatResult)
  • return odsParser.emitError(loc, &quot;expected keyword for GPU object format&quot;);
  • FailureOr&lt;StringAttr&gt; objectResult =
  •  FieldParser&amp;lt;StringAttr&amp;gt;::parse(odsParser);
    
  • if (failed(objectResult))
  • return odsParser.emitError(odsParser.getCurrentLocation(),
  •                           &amp;quot;failed to parse GPU_ObjectAttr parameter &amp;quot;
    
  •                           &amp;quot;&amp;#x27;object&amp;#x27; which is to be a `StringAttr`&amp;quot;);
    
  • format = *formatResult;
  • object = *objectResult;
  • return success();
    +}

+void printObject(AsmPrinter &amp;odsParser, CompilationTarget format,

  •             StringAttr object) {
    
  • if (format != CompilationTarget::Fatbin)
  • odsParser &lt;&lt; stringifyEnum(format) &lt;&lt; &quot; = &quot;;
  • odsParser &lt;&lt; object;
    +}
    +} // namespace

//===----------------------------------------------------------------------===//
// GPU select object attribute
//===----------------------------------------------------------------------===//
@@ -2020,6 +2055,14 @@ SymbolTable *TargetOptions::getSymbolTable() const {
return getSymbolTableCallback ? getSymbolTableCallback() : nullptr;
}

+CompilationTarget TargetOptions::getCompilationTarget() const {

  • return compilationTarget;
    +}

+CompilationTarget TargetOptions::getDefaultCompilationTarget() {

  • return CompilationTarget::Fatbin;
    +}

std::pair&lt;llvm::BumpPtrAllocator, SmallVector&lt;const char *&gt;&gt;
TargetOptions::tokenizeCmdOptions() const {
std::pair&lt;llvm::BumpPtrAllocator, SmallVector&lt;const char *&gt;&gt; options;
@@ -2043,10 +2086,6 @@ TargetOptions::tokenizeCmdOptions() const {
return options;
}

-TargetOptions::CompilationTarget TargetOptions::getCompilationTarget() const {

  • return compilationTarget;
    -}

MLIR_DEFINE_EXPLICIT_TYPE_ID(::mlir::gpu::TargetOptions)

#include &quot;mlir/Dialect/GPU/IR/GPUOpInterfaces.cpp.inc&quot;
diff --git a/mlir/lib/Dialect/GPU/Transforms/ModuleToBinary.cpp b/mlir/lib/Dialect/GPU/Transforms/ModuleToBinary.cpp
index e29a1f0c3248d04..2bf89f8c57903e5 100644
--- a/mlir/lib/Dialect/GPU/Transforms/ModuleToBinary.cpp
+++ b/mlir/lib/Dialect/GPU/Transforms/ModuleToBinary.cpp
@@ -57,14 +57,14 @@ void GpuModuleToBinaryPass::getDependentDialects(

void GpuModuleToBinaryPass::runOnOperation() {
RewritePatternSet patterns(&amp;getContext());

  • int targetFormat = llvm::StringSwitch&lt;int&gt;(compilationTarget)
  •                     .Cases(&amp;quot;offloading&amp;quot;, &amp;quot;llvm&amp;quot;, TargetOptions::offload)
    
  •                     .Cases(&amp;quot;assembly&amp;quot;, &amp;quot;isa&amp;quot;, TargetOptions::assembly)
    
  •                     .Cases(&amp;quot;binary&amp;quot;, &amp;quot;bin&amp;quot;, TargetOptions::binary)
    
  •                     .Cases(&amp;quot;fatbinary&amp;quot;, &amp;quot;fatbin&amp;quot;, TargetOptions::fatbinary)
    
  •                     .Case(&amp;quot;binOrFatbin&amp;quot;, TargetOptions::binOrFatbin)
    
  •                     .Default(-1);
    
  • if (targetFormat == -1)
  • auto targetFormat =

  •  llvm::StringSwitch&amp;lt;std::optional&amp;lt;CompilationTarget&amp;gt;&amp;gt;(compilationTarget)
    
  •      .Cases(&amp;quot;offloading&amp;quot;, &amp;quot;llvm&amp;quot;, CompilationTarget::Offload)
    
  •      .Cases(&amp;quot;assembly&amp;quot;, &amp;quot;isa&amp;quot;, CompilationTarget::Assembly)
    
  •      .Cases(&amp;quot;binary&amp;quot;, &amp;quot;bin&amp;quot;, CompilationTarget::Binary)
    
  •      .Cases(&amp;quot;fatbinary&amp;quot;, &amp;quot;fatbin&amp;quot;, CompilationTarget::Fatbin)
    
  •      .Default(std::nullopt);
    
  • if (!targetFormat)
    getOperation()-&gt;emitError() &lt;&lt; &quot;Invalid format specified.&quot;;

    // Lazy symbol table builder callback.
    @@ -82,10 +82,8 @@ void GpuModuleToBinaryPass::runOnOperation() {
    return &amp;parentTable.value();
    };

  • TargetOptions targetOptions(
  •  toolkitPath, linkFiles, cmdOptions,
    
  •  static_cast&amp;lt;TargetOptions::CompilationTarget&amp;gt;(targetFormat),
    
  •  lazyTableBuilder);
    
  • TargetOptions targetOptions(toolkitPath, linkFiles, cmdOptions, *targetFormat,
  •                          lazyTableBuilder);
    
    if (failed(transformGpuModulesToBinaries(
    getOperation(),
    offloadingHandler ? dyn_cast&lt;OffloadingLLVMTranslationAttrInterface&gt;(
    @@ -107,17 +105,19 @@ LogicalResult moduleSerializer(GPUModuleOp op,
    auto target = dyn_cast&lt;gpu::TargetAttrInterface&gt;(targetAttr);
    assert(target &amp;&amp;
    &quot;Target attribute doesn&#x27;t implements TargetAttrInterface.&quot;);
  • std::optional&lt;SmallVector&lt;char, 0&gt;&gt; object =
  • std::optional&lt;SmallVector&lt;char, 0&gt;&gt; serializedModule =
    target.serializeToObject(op, targetOptions);
  • if (!object) {
  • if (!serializedModule) {
    op.emitError(&quot;An error happened while serializing the module.&quot;);
    return failure();
    }
  • objects.push_back(builder.getAttr&lt;gpu::ObjectAttr&gt;(
  •    target,
    
  •    builder.getStringAttr(StringRef(object-&amp;gt;data(), object-&amp;gt;size()))));
    
  • Attribute object = target.createObject(*serializedModule, targetOptions);
  • if (!object) {
  •  op.emitError(&amp;quot;An error happened while creating the object.&amp;quot;);
    
  •  return failure();
    
  • }
  • objects.push_back(object);
    }
    builder.setInsertionPointAfter(op);
    builder.create&lt;gpu::BinaryOp&gt;(op.getLoc(), op.getName(), handler,
    diff --git a/mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp b/mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
    index 7bf6804902479a8..d19d473a5327627 100644
    --- a/mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
    +++ b/mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
    @@ -126,6 +126,27 @@ extern &quot;C&quot; MLIR_CUDA_WRAPPERS_EXPORT CUmodule mgpuModuleLoad(void *data) {
    return module;
    }

+extern &quot;C&quot; MLIR_CUDA_WRAPPERS_EXPORT CUmodule mgpuModuleLoadJIT(void *data,

  •                                                            int optLevel) {
    
  • ScopedContext scopedContext;
  • CUmodule module = nullptr;
  • char jitErrorBuffer[4096] = {0};
  • CUjit_option jitOptions[] = {CU_JIT_ERROR_LOG_BUFFER,
  •                           CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES,
    
  •                           CU_JIT_OPTIMIZATION_LEVEL};
    
  • void *jitOptionsVals[] = {jitErrorBuffer,
  •                        reinterpret_cast&amp;lt;void *&amp;gt;(sizeof(jitErrorBuffer)),
    
  •                        reinterpret_cast&amp;lt;void *&amp;gt;(optLevel)};
    
  • CUresult result =
  •  cuModuleLoadDataEx(&amp;amp;module, data, 3, jitOptions, jitOptionsVals);
    
  • if (result) {
  • fprintf(stderr, &quot;JIT compilation failed with: &#x27;%s&#x27;\n&quot;, jitErrorBuffer);
  • CUDA_REPORT_IF_ERROR(result);
  • }
  • return module;
    +}

extern &quot;C&quot; MLIR_CUDA_WRAPPERS_EXPORT void mgpuModuleUnload(CUmodule module) {
CUDA_REPORT_IF_ERROR(cuModuleUnload(module));
}
diff --git a/mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp b/mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp
index bd3868a8e196f6f..da2ae87fef6715f 100644
--- a/mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp
+++ b/mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp
@@ -38,6 +38,11 @@ extern &quot;C&quot; hipModule_t mgpuModuleLoad(void *data) {
return module;
}

+extern &quot;C&quot; hipModule_t mgpuModuleLoadJIT(void *data, int optLevel) {

  • assert(false &amp;&amp; &quot;This function is not available in HIP.&quot;);
  • return nullptr;
    +}

extern &quot;C&quot; void mgpuModuleUnload(hipModule_t module) {
HIP_REPORT_IF_ERROR(hipModuleUnload(module));
}
diff --git a/mlir/lib/Target/LLVM/NVVM/Target.cpp b/mlir/lib/Target/LLVM/NVVM/Target.cpp
index 13188b1107d928b..7f263627db54fbe 100644
--- a/mlir/lib/Target/LLVM/NVVM/Target.cpp
+++ b/mlir/lib/Target/LLVM/NVVM/Target.cpp
@@ -47,6 +47,10 @@ class NVVMTargetAttrImpl
std::optional&lt;SmallVector&lt;char, 0&gt;&gt;
serializeT...

@aartbik
Copy link
Contributor

aartbik commented Sep 13, 2023

Ad "An option needs to be added to the SparseCompiler to support the format option, however I didn't know if there's any preference."

Since I don't see changes in the sparse code, I assume you want some feedback, but I need a bit more context on what you had in mind. In general, we have a lot of "knobs" in the sparse pipeline setup, so generally I am not opposed to adding one more ;-)

@fabianmcg
Copy link
Contributor Author

Since I don't see changes in the sparse code, I assume you want some feedback, but I need a bit more context on what you had in mind. In general, we have a lot of "knobs" in the sparse pipeline setup, so generally I am not opposed to adding one more ;-)

With this patch we have 3 ways to compile code, JIT->format=assembly, Cubin->format=bin, Fatbin->format=fatbin, JIT will obviously add a performance hit at runtime, so the questions are:

  1. Is it okay to add another option to the sparse compiler to specify which format to use?
  2. Is there a preference to which option to use by default?

JIT will make the test work, but fatbin is preferable for runtime performance as it can be used for AOT.

@aartbik
Copy link
Contributor

aartbik commented Sep 13, 2023

Ad "Is it okay to add another option to the sparse compiler to specify which format to use?"

Yes, more than okay!

Ass "Is there a preference to which option to use by default?"

If JIT make the test work again, let's make that the default. But please describe the three options in detail with performance implications (possibly indirectly by referring to where you add this as comment)

@fabianmcg
Copy link
Contributor Author

fabianmcg commented Sep 13, 2023

If JIT make the test work again, let's make that the default. But please describe the three options in detail with performance implications (possibly indirectly by referring to where you add this as comment)

I'll do that. Btw, did you have the chance to try the fix I posted in #65857 yesterday?

@grypp
Copy link
Member

grypp commented Sep 13, 2023

I haven't looked at the code carefully, I will do that on my tomorrow, but adding JIT sounds great.

Is there a preference to which option to use by default?

Would it be okay if we didn't do this for default behaviour?

Nvidia's state-of-art compiler is nvcc and it uses ptxas, not JIT. When comparing the performance of MLIR, it gives sanity to use same toolchain as nvcc. I had problems using the driver for JIT, which produced different SASS code from the exact same PTX, even though it was the same version with ptxas.

@fabianmcg fabianmcg requested a review from a team as a code owner September 13, 2023 20:44
@llvmbot llvmbot added the mlir:sparse Sparse compiler in MLIR label Sep 13, 2023
@fabianmcg
Copy link
Contributor Author

Would it be okay if we didn't do this for default behaviour?

The default behavior remains fatbin. However, I made isa (JIT) the default behavior for the sparse compiler.

Another option, is setting MLIR_GPU_COMPILATION_TEST_FORMAT=isa so tests run with JIT, but everywhere else keep fatbin as the default behavior.

I'm inclined to keep fatbin as default everywhere and make downstream users set MLIR_GPU_COMPILATION_TEST_FORMAT=isa in their builds.

@aartbik
Copy link
Contributor

aartbik commented Sep 14, 2023

Perfectly okay with another default for the sparse compiler for consistency. I was merely suggesting this so the test would pass without changes but explicitly marking them as isa. I don't feel strongly either way, so please pick whatever feels best.

GPU_ObjectISA,
GPU_ObjectBinary,
GPU_ObjectFatbin
]>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves some doc.

(I'm not totally sure right now what "offload" does in this list actually)

Copy link
Contributor Author

@fabianmcg fabianmcg Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the docs in the ObjectAttr docs. The offload format is meant to be a generic format, for NVPTX and & AMDGPU it generates LLVM bitcode. Execution from this format is not enabled in trunk, however downstream users could use it.

@@ -32,8 +44,17 @@ def GPU_ObjectAttr : GPU_Attr<"Object", "object"> {
#gpu.object<#nvvm.target, "...">
```
}];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the doc please

binOrFatbin =
binary |
fatbinary, /// The process should produce a binary or fatbinary. It's up
/// to the target to decide which.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is the doc that may have been lost moving to ODS, cf the other comment above)

@@ -144,6 +144,22 @@ struct SparseCompilerOptions
desc("GPU target architecture")};
PassOptions::Option<std::string> gpuFeatures{*this, "gpu-features",
desc("GPU target features")};
/// For NVIDIA GPUs there are 3 compilation format options:
/// 1. `isa`: the compiler generates PTX and the runtime JITs the PTX.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// 1. `isa`: the compiler generates PTX and the runtime JITs the PTX.
/// 1. `isa`: the compiler generates PTX and the driver JITs the PTX.

/// GPU running the program.
/// Option 3 is the best compromise between options 1 & 2 as it can JIT in
/// case of an arch mismatch, however, it's only possible to JIT to a higher
/// CC than `gpuChip`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the CC target when using 1.?

To some extent there shouldn't be any difference between 1 and 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's never specified that's why gpu-to-cubin always worked, it's always JITted to the running arch.

If there's an arch mismatch then 1 and 3 have the same performance hit, however if the compiled arch matches the running arch, then it behaves like 2 and there's no performance hit.

@fabianmcg
Copy link
Contributor Author

fabianmcg commented Sep 14, 2023

The last commit updated the docs, migrated all tests to use MLIR_GPU_COMPILATION_TEST_FORMAT, thus downstream users can simply set -DMLIR_GPU_COMPILATION_TEST_FORMAT=isa when building and all tests should work, if @aartbik or @grypp can verify this it would be appreciated.
For the sake of consistency I also made fatbin the default format everywhere, including the sparse compiler.

@fabianmcg fabianmcg requested a review from joker-eph September 14, 2023 14:18
/// 3. `fatbin`: generates a fat binary with a CUBIN object for `gpuChip` and
/// also embeds the PTX in the fat binary.
/// Notes:
/// Option 1 adds a significant runtime performance hit, however, tests are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this detailed explanation.

Copy link
Contributor

@aartbik aartbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this change! I have a few nits, but good to go once addressed so I am approving this (for the sparse changes part)

/// Option 1 adds a significant runtime performance hit, however, tests are
/// more likely to pass with this option.
/// Option 2 is better for execution time as there is no JIT; however, the
/// program will fail if there's an arch mismatch between `gpuChip` and the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you please spell out "architecture" (unless this is NVidia convention to write it that way)

@@ -1,2 +1,4 @@
if not config.enable_cuda_runner or not config.mlir_run_cuda_sm80_tests:
config.unsupported = True

config.substitutions.append(("%format", config.gpu_compilation_format))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please use a slightly more specific name for this (format is very generic, how about at least gpu_format or so)

@fabianmcg fabianmcg merged commit 5093413 into llvm:main Sep 14, 2023
@aartbik
Copy link
Contributor

aartbik commented Sep 14, 2023

Just a random note, @fabianmcg , that I really appreciate your refactoring. The GPU "pipeline" for the sparse compiler still had a few rough edges and you really smoothed these out! So, thanks!

@fabianmcg
Copy link
Contributor Author

Thank you, happy to help! Also thanks for all the feedback and testing!

@fabianmcg fabianmcg deleted the ptx-jit branch September 18, 2023 15:47
ZijunZhaoCCK pushed a commit to ZijunZhaoCCK/llvm-project that referenced this pull request Sep 19, 2023
This patch adds an NVPTX compilation path that enables JIT compilation
on NVIDIA targets. The following modifications were performed:
1. Adding a format field to the GPU object attribute, allowing the
translation attribute to use the correct runtime function to load the
module. Likewise, a dictionary attribute was added to add any possible
extra options.

2. Adding the `createObject` method to `GPUTargetAttrInterface`; this
method returns a GPU object from a binary string.

3. Adding the function `mgpuModuleLoadJIT`, which is only available for
NVIDIA GPUs, as there is no equivalent for AMD.

4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify
the format to use during testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants