Lower SLM to XeGPU #409

dchigarev · 2024-11-13T15:56:46Z

Fixes #394. The PR adds support for SLM to linalg-to-xegpu pass.

SLM requires special handling in XeGPU, we can only load it via scatter descriptors and only 32 elements per load (more info in the issue).

The flow of working with SLM in XeGPU is the following:

Creating descriptors

Flatten 2D memref via memref.reinterpret_cast since scatter descriptors only work with 1D memrefs
Since imex::ConvertGPUXToSPIRV pass doesn't allow memref.subviews inside a gpu kernel, we have to merge subview offsets with the offsets for the root xegpu.descriptor. So the step 2 is basically to compute offsets for the beginning of the SLM block for this thread.
Do we merge `subview` offsets for block-descriptors as well?
Yes. There's a separate pass in upstream (XeGPUFoldAliasOps) that does it. It only works with blocked descriptors though, meaning that for scattered ones we have to implement the logic on our own.
Then it computes offsets per each load taking into account row & col tiles. The descriptors are returned in a way to load rows first.

MLIR example

// thread chunk size = [16, 32]
// num_threads_x = 2; num_threads_y = 2
%slm_row_idx = %thread_idx_x * 16
%slm_col_idx = %thread_idx_y * 32

%slm_buff = memref.alloc() : memref<16 * %num_threads_x = 32, 32 * %num_threads_y = 64, 3>
%slm_chunk = memref.subview %slm_buff[%slm_row_idx, %slm_col_idx] : memref<32x64, 3> to memref<16x32, 3>

// want to load four 8x16 tiles
// createSLMDescTiles(src=%slm_chunk, loadShape=[16, 32], descTile=[8, 16]) produces:

%slm_flat = memref.reinterpret_cast %slm_buff : memref<32x64, 3> to memref<2048, 3>
%slm_offset = %slm_row_idx * 64 + %slm_col_idx

// createSLMDescTiles then calls
// createScatterDescriptorTiles(flatMemref=%slm_flat, loadShape2D=[16, 32], tileSize2D=[8, 16],
//                              memrefStrides=[64, 1], blockOffset=%slm_offset) that produces:

// This indicates how many rows of a single tile (defined by tileSize2D) are loaded
// per single load operation (single load loads exactly 32 elements).
%numRowsPerLoad = 32 / %tileSize2D[1] = 32 / 16 = 2
// This indicates the offset between two loads
%offset_per_load = %rowStride * %numRowsPerLoad = 64 * 2 = 128

// col tile 0
// Load offsets for colTile0
%offsetShiftValues_colTile_0 = [
            /*first-row*/ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
            /*second-row*/64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] : vector<32xindex>
                          ^--- stride

// first 8x16 tile [rowTileIdx=0, colTileIdx=0]
%desc0 = xegpu.create_descriptor %slm_flat, %slm_offset + %offsetShiftValues_colTile_0 : xegpu.descriptor<32xf16>
%desc1 = xegpu.update_descriptor %desc0, %offset_per_load
%desc2 = xegpu.update_descriptor %desc1, %offset_per_load
%desc3 = xegpu.update_descriptor %desc2, %offset_per_load

// second 8x16 tile [rowTileIdx=1, colTileIdx=0]
%desc4 = xegpu.update_descriptor %desc3, %offset_per_load
%desc5 = xegpu.update_descriptor %desc4, %offset_per_load
%desc6 = xegpu.update_descriptor %desc5, %offset_per_load
%desc7 = xegpu.update_descriptor %desc6, %offset_per_load

// col tile 1
// Load offsets for colTile1
%offsetShiftValues_colTile_1 = [
            /*first-row*/ 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
            /*second-row*/80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95] : vector<32xindex>

// third 8x16 tile [rowTileIdx=0, colTileIdx=1]
%desc8 = xegpu.create_descriptor %slm_flat, %slm_offset + %offsetShiftValues_colTile_1 : xegpu.descriptor<32xf16>
%desc9 = xegpu.update_descriptor %desc8, %offset_per_load
%desc10 = xegpu.update_descriptor %desc9, %offset_per_load
%desc11 = xegpu.update_descriptor %desc10, %offset_per_load

// fourth 8x16 tile [rowTileIdx=1, colTileIdx=1]
%desc12 = xegpu.update_descriptor %desc11, %offset_per_load
%desc13 = xegpu.update_descriptor %desc12, %offset_per_load
%desc14 = xegpu.update_descriptor %desc13, %offset_per_load
%desc15 = xegpu.update_descriptor %desc14, %offset_per_load

// createSLMDescTiles() returns [
//      %desc0, %desc1, %desc2, %desc3, %desc8, %desc9, %desc10, %desc11,
//      %desc4, %desc5, %desc6, %desc7, %desc12, %desc13, %desc14, %desc15
//]     ^-- note that it returns row tiles first in order to match with the logic of
//         blocked-descriptors ([rowTileIdx=0, colTileIdx=0], [rowTileIdx=0, colTileIdx=1], [rowTileIdx=1, colTileIdx=0], [rowTileIdx=1, colTileIdx=1])

Loading data

Since we can only load 32 elements per load, we create an accumulator vector with the total number of elements equals to a single tile (for 8x16 tile it will be 4x32).
We then sequentially load elements for a single tile and push them to the accumulator vector
Once the vector is full we cast it to the tile shape (4x32 -> 8x16)
Repeat the process until all tiles are loaded.

MLIR example

////// LOAD

// loadScatterDescTiles(loadTiles=[desc0-desc3, desc8-desc11, desc4-desc7, desc12-desc15], tileShape=[8, 16]) produces

// vector that will store a single 8x16 tile
%flatAccum = arith.constant dense<0.0> : vector<8*16=128xf16>
// each row in 'accum' is a single load
%accum = vector.shape_cast %flatAccum : vector<128xf16> to vector<4x32xf16>

// load first 8x16 tile ([rowTileIdx=0, colTileIdx=0])
%load0 = xegpu.load %desc0, %mask : vector<32xf16>
%accum = vector.insert %load0, %accum[0] : vector<4x32xf16>

%load1 = xegpu.load %desc1,      %mask : vector<32xf16>
%accum = vector.insert %load1, %accum[1] : vector<4x32xf16>

%load2 = xegpu.load %desc2,      %mask : vector<32xf16>
%accum = vector.insert %load2, %accum[2] : vector<4x32xf16>

%load3 = xegpu.load %desc3,      %mask : vector<32xf16>
%accum = vector.insert %load3, %accum[3] : vector<4x32xf16>

%accum = vector.shape_cast %accum : vector<4x32xf16> to vector<128xf16>
%loadedTile0 = vector.shape_cast %accum : vector<128xf16> to vector<8x16xf16>

// load second 8x16 tile ([rowTileIdx=0, colTileIdx=1])
%load4 = xegpu.load %desc8, %mask : vector<32xf16>
%accum = vector.insert %load4, %accum[0] : vector<4x32xf16>

%load5 = xegpu.load %desc9,      %mask : vector<32xf16>
%accum = vector.insert %load5, %accum[1] : vector<4x32xf16>

%load6 = xegpu.load %desc10,      %mask : vector<32xf16>
%accum = vector.insert %load7, %accum[2] : vector<4x32xf16>

%load7 = xegpu.load %desc11,      %mask : vector<32xf16>
%accum = vector.insert %load7, %accum[3] : vector<4x32xf16>

%accum = vector.shape_cast %accum : vector<4x32xf16> to vector<128xf16>
%loadedTile1 = vector.shape_cast %accum : vector<128xf16> to vector<8x16xf16>

// load third 8x16 tile ([rowTileIdx=1, colTileIdx=0])
%load8 = xegpu.load %desc4, %mask : vector<32xf16>
%accum = vector.insert %load8, %accum[0] : vector<4x32xf16>

%load9 = xegpu.load %desc5,      %mask : vector<32xf16>
%accum = vector.insert %load9, %accum[1] : vector<4x32xf16>

%load10 = xegpu.load %desc6,      %mask : vector<32xf16>
%accum = vector.insert %load10, %accum[2] : vector<4x32xf16>

%load11 = xegpu.load %desc7,      %mask : vector<32xf16>
%accum = vector.insert %load11, %accum[3] : vector<4x32xf16>

%accum = vector.shape_cast %accum : vector<4x32xf16> to vector<128xf16>
%loadedTile2 = vector.shape_cast %accum : vector<128xf16> to vector<8x16xf16>

// load fourth 8x16 tile ([rowTileIdx=1, colTileIdx=1])
%load12 = xegpu.load %desc12, %mask : vector<32xf16>
%accum = vector.insert %load12, %accum[0] : vector<4x32xf16>

%load13 = xegpu.load %desc13,      %mask : vector<32xf16>
%accum = vector.insert %load13, %accum[1] : vector<4x32xf16>

%load14 = xegpu.load %desc14,      %mask : vector<32xf16>
%accum = vector.insert %load14, %accum[2] : vector<4x32xf16>

%load15 = xegpu.load %desc15,      %mask : vector<32xf16>
%accum = vector.insert %load15, %accum[3] : vector<4x32xf16>

%accum = vector.shape_cast %accum : vector<4x32xf16> to vector<128xf16>
%loadedTile3 = vector.shape_cast %accum : vector<128xf16> to vector<8x16xf16>

// loadScatterDescTiles() returns [%loadedTile0, %loadedTile1, %loadedTile2, %loadedTile3]

Storing data

Since we can only store 32 elements per one store, we first flatten all the vector tiles (8x16 -> 128)
And then extract slices of 32 elements from the flattened vector and store them.

MLIR example

////// STORE

// storeScatterDescTiles(results=[%loadedTile0, %loadedTile1, %loadedTile2, %loadedTile3],
//                       storeTiles=[desc0-desc3, desc8-desc11, desc4-desc7, desc12-desc15]) produces

// store first 8x16 tile ([rowTileIdx=0, colTileIdx=0])
%flatResult0 = vector.shape_cast %loadedTile0 : vector<8x16xf16> to vector<128xf16>

%store0 = vector.extract_strided_slice %flatResult0 : offset = [0], size = [32] -> vector<32xf16>
xegpu.store %store0, %desc0

%store1 = vector.extract_strided_slice %flatResult0 : offset = [32], size = [32] -> vector<32xf16>
xegpu.store %store1, %desc1

%store2 = vector.extract_strided_slice %flatResult0 : offset = [64], size = [32] -> vector<32xf16>
xegpu.store %store2, %desc2

%store3 = vector.extract_strided_slice %flatResult0 : offset = [96], size = [32] -> vector<32xf16>
xegpu.store %store3, %desc3

// store second 8x16 tile ([rowTileIdx=0, colTileIdx=1])
%flatResult1 = vector.shape_cast %loadedTile1 : vector<8x16xf16> to vector<128xf16>

%store4 = vector.extract_strided_slice %flatResult1 : offset = [0], size = [32] -> vector<32xf16>
xegpu.store %store4, %desc8

%store5 = vector.extract_strided_slice %flatResult1 : offset = [32], size = [32] -> vector<32xf16>
xegpu.store %store5, %desc9

%store6 = vector.extract_strided_slice %flatResult1 : offset = [64], size = [32] -> vector<32xf16>
xegpu.store %store6, %desc10

%store7 = vector.extract_strided_slice %flatResult1 : offset = [96], size = [32] -> vector<32xf16>
xegpu.store %store7, %desc11

// store third 8x16 tile ([rowTileIdx=1, colTileIdx=0])
%flatResult2 = vector.shape_cast %loadedTile2 : vector<8x16xf16> to vector<128xf16>

%store8 = vector.extract_strided_slice %flatResult2 : offset = [0], size = [32] -> vector<32xf16>
xegpu.store %store8, %desc4

%store9 = vector.extract_strided_slice %flatResult2 : offset = [32], size = [32] -> vector<32xf16>
xegpu.store %store9, %desc5

%store10 = vector.extract_strided_slice %flatResult2 : offset = [64], size = [32] -> vector<32xf16>
xegpu.store %store10, %desc6

%store11 = vector.extract_strided_slice %flatResult2 : offset = [96], size = [32] -> vector<32xf16>
xegpu.store %store11, %desc7

// store fourth 8x16 tile ([rowTileIdx=1, colTileIdx=1])
%flatResult3 = vector.shape_cast %loadedTile3 : vector<8x16xf16> to vector<128xf16>

%store12 = vector.extract_strided_slice %flatResult3 : offset = [0], size = [32] -> vector<32xf16>
xegpu.store %store12, %desc12

%store13 = vector.extract_strided_slice %flatResult3 : offset = [32], size = [32] -> vector<32xf16>
xegpu.store %store13, %desc13

%store14 = vector.extract_strided_slice %flatResult3 : offset = [64], size = [32] -> vector<32xf16>
xegpu.store %store14, %desc14

%store15 = vector.extract_strided_slice %flatResult3 : offset = [96], size = [32] -> vector<32xf16>
xegpu.store %store15, %desc15

As you can notice there's a lot of efforts required to load/store tiles from SLM. Even loading/storing a single 16x16 block requires 8 loads + 8 vector.insert ops + 8 stores + 8 vector.extract_strided_slice ops. It seems that it won't perform very well and that we should avoid using SLM where possible (through ops-fusion for example)

Signed-off-by: dchigarev <[email protected]>

dchigarev · 2024-11-14T12:00:21Z

lib/gc/Transforms/GPU/Pipeline.cpp

  pm.addNestedPass<func::FuncOp>(createLinalgToXeGPU(
      {/*kTile=*/16, /*stages=*/1, /*dpasTiles=*/{8, 16, 16}}));
+  pm.addPass(createCSEPass());


added CSE pass to minimize impact of the insert/extract manipulations with vectors

dchigarev · 2024-11-14T12:02:31Z

lib/gc/Transforms/GPU/Pipeline.cpp


  imex::InsertGPUAllocsOptions insertGPUAllocsOption{
      /*clientAPI*/ "opencl", /*inRegions*/ false,
      /*isUsmArgs*/ pipelineOpts.isUsmArgs};
  pm.addNestedPass<func::FuncOp>(
      imex::createInsertGPUAllocsPass(insertGPUAllocsOption));
  pm.addPass(createGpuKernelOutliningPass());
-  pm.addPass(createCanonicalizerPass());


Canonicalizer converts vector.from_elements [%val, %val, ... %val] into vector.splat %val that causes the imex::ConvertGPUXToSPIRVPass to fail (it seems it doesn't support vector.splat). So removed canonicalizer

kurapov-peter

Looks good, some questions and comments inlined

kurapov-peter · 2024-11-14T12:37:29Z

lib/gc/Transforms/GPU/LinalgToXeGPU.cpp

+  if (!sgMap) {
+    // Assuming default tensor descriptor type (blocked & in global memory).
+    return xegpu::TensorDescType::get(shape, elementType, /*array_length=*/1,
+                                      /*boundary_check=*/true);
+  }


sgmap shouldn't have anything to do with the type of the descriptor type.

there are two types of tensor attributes that are called sg_map in the implementation of xeGPU dialect:

ScatterTensorDescAttr - for scatter descriptors

BlockTensorDescAttr - for block descriptors

They describe two kinds (the type is the same indeed) of descriptors and the kind depends on sg_map

sg_map has nothing to do with the tensor descriptor attribute (they are not called sg_map), it is a separate attribute that describes data chunks access by individual threads withing a subgroup

ah, okay :)

renamed sgMap -> descAttr

kurapov-peter · 2024-11-14T12:50:36Z

lib/gc/Transforms/Utils/ValueUtils.cpp

@@ -150,5 +151,47 @@ std::pair<Value, Value> getPtrAndOffset(OpBuilder &builder, Value operand) {
  return std::make_pair(alignedPointer, offset);
 }

+Value flattenMemref(PatternRewriter &rewriter, Location loc, Value srcMemref) {


I think I saw something very similar in LowerQuantOps.cpp. Maybe reuse is possible.

I think I saw something very similar in LowerQuantOps.cpp

We don't have this file in our project. What you're referring to?

Ah, okay, found it in LLVM.

They flatten tensors there, not memrefs

kurapov-peter · 2024-11-14T12:51:59Z

lib/gc/Transforms/GPU/LinalgToXeGPU.cpp

+  assert(llvm::all_of(storeTiles,
+                      [&](Value tile) { return tile.getType() == tileType; }) &&
+         "All load tiles must have the same type.");
+  assert(tileType.getShape().size() == 1 && "Scatter tiles must be 1D");


Is this also coming from lowering restrictions?

I would say it's xegpu limitation. SLM for f16 can only be loaded/stored via 1D scatter descriptors

kurapov-peter · 2024-11-14T12:57:54Z

lib/gc/Transforms/GPU/LinalgToXeGPU.cpp

+          // Do we need those for SLM?
+          /*l1_hint=*/hint, /*l2_hint=*/hint, /*l3_hint=*/hint);


not sure, will need to double-check

well, if nothing crashes with them, i think we can keep them :D

lib/gc/Transforms/GPU/LinalgToXeGPU.cpp

kurapov-peter · 2024-11-14T13:04:57Z

lib/gc/Transforms/GPU/LinalgToXeGPU.cpp

+// The shape to be loaded is split into the largest 2D loads supported
+// by the hardware.


what happens to, say, 1d tensors?

I don't know. I would assume it will crash in the exact same way as the current linalg-to-xegpu lowering does

An attempt to use linalg-to-xegpu pass with 1D tensors/memrefs on the current main branch

gc-opt: /home/jovyan/llvm/mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp:83: static void mlir::xegpu::CreateNdDescOp::build(mlir::OpBuilder&, mlir::OperationState&, mlir::Type, mlir::TypedValue<mlir::MemRefType>, llvm::ArrayRef<mlir::OpFoldResult>): Assertion `ty.hasStaticShape() && offsets.size() == (size_t)ty.getRank()' failed. PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace. Stack dump: 0. Program arguments: ./bin/gc-opt /home/jovyan/graph-compiler/test/mlir/test/gc/Transforms/GPU/linalg-to-xegpu1d.mlir "-linalg-to-xegpu=dpas-tile=8,16,16 k-tile=16" -canonicalize -split-input-file #0 0x00005571ec59bb30 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (./bin/gc-opt+0x589bb30) #1 0x00005571ec598f3f llvm::sys::RunSignalHandlers() (./bin/gc-opt+0x5898f3f) #2 0x00005571ec599095 SignalHandler(int) Signals.cpp:0:0 #3 0x00007fd97d43f520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520) #4 0x00007fd97d4939fc __pthread_kill_implementation ./nptl/pthread_kill.c:44:76 #5 0x00007fd97d4939fc __pthread_kill_internal ./nptl/pthread_kill.c:78:10 #6 0x00007fd97d4939fc pthread_kill ./nptl/pthread_kill.c:89:10 #7 0x00007fd97d43f476 gsignal ./signal/../sysdeps/posix/raise.c:27:6 #8 0x00007fd97d4257f3 abort ./stdlib/abort.c:81:7 #9 0x00007fd97d42571b _nl_load_domain ./intl/loadmsgcat.c:1177:9 #10 0x00007fd97d436e96 (/lib/x86_64-linux-gnu/libc.so.6+0x39e96) #11 0x00005571e931f149 mlir::xegpu::CreateNdDescOp::build(mlir::OpBuilder&, mlir::OperationState&, mlir::Type, mlir::detail::TypedValue<mlir::MemRefType>, llvm::ArrayRef<mlir::OpFoldResult>) (./bin/gc-opt+0x261f149) #12 0x00005571e9a4de94 mlir::xegpu::CreateNdDescOp mlir::OpBuilder::create<mlir::xegpu::CreateNdDescOp, mlir::xegpu::TensorDescType&, mlir::detail::TypedValue<mlir::MemRefType>, llvm::SmallVector<mlir::OpFoldResult, 6u>&>(mlir::Location, mlir::xegpu::TensorDescType&, mlir::detail::TypedValue<mlir::MemRefType>&&, llvm::SmallVector<mlir::OpFoldResult, 6u>&) /home/jovyan/llvm/llvm-install-imex-17_oct/include/mlir/IR/Builders.h:517:22 #13 0x00005571e9a2af4e (anonymous namespace)::createDescriptorTiles(mlir::PatternRewriter&, mlir::Location, mlir::Value, llvm::ArrayRef<long>, llvm::ArrayRef<long>, llvm::ArrayRef<long>, int, bool) /home/jovyan/graph-compiler/lib/gc/Transforms/GPU/LinalgToXeGPU.cpp:661:0 #14 0x00005571e9a2b585 (anonymous namespace)::createCoarseDscTiles(mlir::PatternRewriter&, mlir::Location, mlir::Value, llvm::ArrayRef<long>, bool, bool) /home/jovyan/graph-compiler/lib/gc/Transforms/GPU/LinalgToXeGPU.cpp:735:0 #15 0x00005571e9a2fa31 (anonymous namespace)::createEltwiseKernel(mlir::linalg::LinalgOp, mlir::PatternRewriter&) /home/jovyan/graph-compiler/lib/gc/Transforms/GPU/LinalgToXeGPU.cpp:1321:0 #16 0x00005571e9a42f80 (anonymous namespace)::ConvertNamedEltwiseToXeGPU<mlir::linalg::AddOp>::matchAndRewrite(mlir::linalg::AddOp, mlir::PatternRewriter&) const /home/jovyan/graph-compiler/lib/gc/Transforms/GPU/LinalgToXeGPU.cpp:1494:0 #17 0x00005571e9a6590c mlir::detail::OpOrInterfaceRewritePatternBase<mlir::linalg::AddOp>::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&) const /home/jovyan/llvm/llvm-install-imex-17_oct/include/mlir/IR/PatternMatch.h:332:3 #18 0x00005571ec0c6bc8 mlir::PatternApplicator::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&, llvm::function_ref<bool (mlir::Pattern const&)>, llvm::function_ref<void (mlir::Pattern const&)>, llvm::function_ref<llvm::LogicalResult (mlir::Pattern const&)>) (./bin/gc-opt+0x53c6bc8) #19 0x00005571ec08f3de (anonymous namespace)::GreedyPatternRewriteDriver::processWorklist() GreedyPatternRewriteDriver.cpp:0:0 #20 0x00005571ec091be5 mlir::applyPatternsAndFoldGreedily(mlir::Region&, mlir::FrozenRewritePatternSet const&, mlir::GreedyRewriteConfig, bool*) (./bin/gc-opt+0x5391be5) #21 0x00005571e9915394 mlir::applyPatternsAndFoldGreedily(mlir::Operation*, mlir::FrozenRewritePatternSet const&, mlir::GreedyRewriteConfig, bool*) /home/jovyan/llvm/llvm-install-imex-17_oct/include/mlir/Transforms/GreedyPatternRewriteDriver.h:159:37 #22 0x00005571e9a30f6e (anonymous namespace)::LinalgToXeGPU::runOnOperation() /home/jovyan/graph-compiler/lib/gc/Transforms/GPU/LinalgToXeGPU.cpp:1649:0 #23 0x00005571ec1c3479 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (./bin/gc-opt+0x54c3479) #24 0x00005571ec1c3931 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (./bin/gc-opt+0x54c3931) #25 0x00005571ec1c3cd6 mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::'lambda'(mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo&)::operator()(mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo&) const Pass.cpp:0:0 #26 0x00005571ec1c29a5 mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) (./bin/gc-opt+0x54c29a5) #27 0x00005571ec1c3280 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (./bin/gc-opt+0x54c3280) #28 0x00005571ec1c3931 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (./bin/gc-opt+0x54c3931) #29 0x00005571ec1c4995 mlir::PassManager::run(mlir::Operation*) (./bin/gc-opt+0x54c4995) #30 0x00005571e98c5217 performActions(llvm::raw_ostream&, std::shared_ptr<llvm::SourceMgr> const&, mlir::MLIRContext*, mlir::MlirOptMainConfig const&) MlirOptMain.cpp:0:0 #31 0x00005571e98c5c2c processBuffer(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, mlir::MlirOptMainConfig const&, mlir::DialectRegistry&, llvm::ThreadPoolInterface*) MlirOptMain.cpp:0:0 #32 0x00005571e98c5d8d llvm::LogicalResult llvm::function_ref<llvm::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>::callback_fn<mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, mlir::DialectRegistry&, mlir::MlirOptMainConfig const&)::'lambda'(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>(long, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&) MlirOptMain.cpp:0:0 #33 0x00005571ec467b1f mlir::splitAndProcessBuffer(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<llvm::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>, llvm::raw_ostream&, llvm::StringRef, llvm::StringRef)::'lambda'(llvm::StringRef)::operator()(llvm::StringRef) const ToolUtilities.cpp:0:0 #34 0x00005571ec468472 mlir::splitAndProcessBuffer(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<llvm::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>, llvm::raw_ostream&, llvm::StringRef, llvm::StringRef) (./bin/gc-opt+0x5768472) #35 0x00005571e98bd56c mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, mlir::DialectRegistry&, mlir::MlirOptMainConfig const&) (./bin/gc-opt+0x2bbd56c) #36 0x00005571e98c5ef0 mlir::MlirOptMain(int, char**, llvm::StringRef, llvm::StringRef, mlir::DialectRegistry&) (./bin/gc-opt+0x2bc5ef0) #37 0x00005571e98c6417 mlir::MlirOptMain(int, char**, llvm::StringRef, mlir::DialectRegistry&) (./bin/gc-opt+0x2bc6417) #38 0x00005571e70d410c main /home/jovyan/graph-compiler/src/gc-opt/gc-opt.cpp:75:0 #39 0x00007fd97d426d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #40 0x00007fd97d426e40 call_init ./csu/../csu/libc-start.c:128:20 #41 0x00007fd97d426e40 __libc_start_main ./csu/../csu/libc-start.c:379:5 #42 0x00005571e70d3cf5 _start (./bin/gc-opt+0x3d3cf5) Aborted (core dumped)

(1D is not supported by linalg-to-xegpu)

lib/gc/Transforms/GPU/LinalgToXeGPU.cpp

Signed-off-by: dchigarev <[email protected]>

dchigarev added 2 commits November 13, 2024 15:52

Lower SLM to XeGPU - draft

6a4cd4a

Signed-off-by: dchigarev <[email protected]>

Add transform test

f63e8f6

Signed-off-by: dchigarev <[email protected]>

dchigarev force-pushed the slm-support branch from fbbe969 to f63e8f6 Compare November 13, 2024 17:00

dchigarev added 3 commits November 13, 2024 17:22

Move util functions to a separate file

e48f743

Signed-off-by: dchigarev <[email protected]>

clang-format

884f4ba

Signed-off-by: dchigarev <[email protected]>

remove unused variables

e2b6eb0

Signed-off-by: dchigarev <[email protected]>

dchigarev commented Nov 14, 2024

View reviewed changes

dchigarev marked this pull request as ready for review November 14, 2024 12:07

dchigarev requested review from AndreyPavlenko and kurapov-peter November 14, 2024 12:07

kurapov-peter reviewed Nov 14, 2024

View reviewed changes

AndreyPavlenko reviewed Nov 14, 2024

View reviewed changes

lib/gc/Transforms/GPU/LinalgToXeGPU.cpp Show resolved Hide resolved

use constexpr for SLM tile size

55b0e2f

Signed-off-by: dchigarev <[email protected]>

dchigarev requested review from kurapov-peter and AndreyPavlenko November 14, 2024 14:36

rename sgMape to descAttr

711e477

Signed-off-by: dchigarev <[email protected]>

kurapov-peter approved these changes Nov 14, 2024

View reviewed changes

AndreyPavlenko approved these changes Nov 14, 2024

View reviewed changes

dchigarev merged commit 672edc9 into intel:main Nov 15, 2024
6 checks passed

dchigarev mentioned this pull request Nov 18, 2024

Implemented tiling and fusion path for GPU #383

Merged

		// Do we need those for SLM?
		/l1_hint=/hint, /l2_hint=/hint, /l3_hint=/hint);

		// The shape to be loaded is split into the largest 2D loads supported
		// by the hardware.

Lower SLM to XeGPU #409

Lower SLM to XeGPU #409

Uh oh!

Conversation

dchigarev commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurapov-peter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dchigarev commented Nov 13, 2024 •

edited

Loading