Skip to content

Lower SLM to XeGPU #409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 15, 2024
Merged

Lower SLM to XeGPU #409

merged 7 commits into from
Nov 15, 2024

Conversation

dchigarev
Copy link
Contributor

@dchigarev dchigarev commented Nov 13, 2024

Fixes #394. The PR adds support for SLM to linalg-to-xegpu pass.

SLM requires special handling in XeGPU, we can only load it via scatter descriptors and only 32 elements per load (more info in the issue).

The flow of working with SLM in XeGPU is the following:

Creating descriptors

  1. Flatten 2D memref via memref.reinterpret_cast since scatter descriptors only work with 1D memrefs
  2. Since imex::ConvertGPUXToSPIRV pass doesn't allow memref.subviews inside a gpu kernel, we have to merge subview offsets with the offsets for the root xegpu.descriptor. So the step 2 is basically to compute offsets for the beginning of the SLM block for this thread.
    Do we merge `subview` offsets for block-descriptors as well? Yes. There's a separate pass in upstream (XeGPUFoldAliasOps) that does it. It only works with blocked descriptors though, meaning that for scattered ones we have to implement the logic on our own.
  3. Then it computes offsets per each load taking into account row & col tiles. The descriptors are returned in a way to load rows first.
MLIR example
// thread chunk size = [16, 32]
// num_threads_x = 2; num_threads_y = 2
%slm_row_idx = %thread_idx_x * 16
%slm_col_idx = %thread_idx_y * 32

%slm_buff = memref.alloc() : memref<16 * %num_threads_x = 32, 32 * %num_threads_y = 64, 3>
%slm_chunk = memref.subview %slm_buff[%slm_row_idx, %slm_col_idx] : memref<32x64, 3> to memref<16x32, 3>

// want to load four 8x16 tiles
// createSLMDescTiles(src=%slm_chunk, loadShape=[16, 32], descTile=[8, 16]) produces:

%slm_flat = memref.reinterpret_cast %slm_buff : memref<32x64, 3> to memref<2048, 3>
%slm_offset = %slm_row_idx * 64 + %slm_col_idx

// createSLMDescTiles then calls
// createScatterDescriptorTiles(flatMemref=%slm_flat, loadShape2D=[16, 32], tileSize2D=[8, 16],
//                              memrefStrides=[64, 1], blockOffset=%slm_offset) that produces:

// This indicates how many rows of a single tile (defined by tileSize2D) are loaded
// per single load operation (single load loads exactly 32 elements).
%numRowsPerLoad = 32 / %tileSize2D[1] = 32 / 16 = 2
// This indicates the offset between two loads
%offset_per_load = %rowStride * %numRowsPerLoad = 64 * 2 = 128

// col tile 0
// Load offsets for colTile0
%offsetShiftValues_colTile_0 = [
            /*first-row*/ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
            /*second-row*/64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] : vector<32xindex>
                          ^--- stride

// first 8x16 tile [rowTileIdx=0, colTileIdx=0]
%desc0 = xegpu.create_descriptor %slm_flat, %slm_offset + %offsetShiftValues_colTile_0 : xegpu.descriptor<32xf16>
%desc1 = xegpu.update_descriptor %desc0, %offset_per_load
%desc2 = xegpu.update_descriptor %desc1, %offset_per_load
%desc3 = xegpu.update_descriptor %desc2, %offset_per_load

// second 8x16 tile [rowTileIdx=1, colTileIdx=0]
%desc4 = xegpu.update_descriptor %desc3, %offset_per_load
%desc5 = xegpu.update_descriptor %desc4, %offset_per_load
%desc6 = xegpu.update_descriptor %desc5, %offset_per_load
%desc7 = xegpu.update_descriptor %desc6, %offset_per_load

// col tile 1
// Load offsets for colTile1
%offsetShiftValues_colTile_1 = [
            /*first-row*/ 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
            /*second-row*/80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95] : vector<32xindex>

// third 8x16 tile [rowTileIdx=0, colTileIdx=1]
%desc8 = xegpu.create_descriptor %slm_flat, %slm_offset + %offsetShiftValues_colTile_1 : xegpu.descriptor<32xf16>
%desc9 = xegpu.update_descriptor %desc8, %offset_per_load
%desc10 = xegpu.update_descriptor %desc9, %offset_per_load
%desc11 = xegpu.update_descriptor %desc10, %offset_per_load

// fourth 8x16 tile [rowTileIdx=1, colTileIdx=1]
%desc12 = xegpu.update_descriptor %desc11, %offset_per_load
%desc13 = xegpu.update_descriptor %desc12, %offset_per_load
%desc14 = xegpu.update_descriptor %desc13, %offset_per_load
%desc15 = xegpu.update_descriptor %desc14, %offset_per_load

// createSLMDescTiles() returns [
//      %desc0, %desc1, %desc2, %desc3, %desc8, %desc9, %desc10, %desc11,
//      %desc4, %desc5, %desc6, %desc7, %desc12, %desc13, %desc14, %desc15
//]     ^-- note that it returns row tiles first in order to match with the logic of
//         blocked-descriptors ([rowTileIdx=0, colTileIdx=0], [rowTileIdx=0, colTileIdx=1], [rowTileIdx=1, colTileIdx=0], [rowTileIdx=1, colTileIdx=1])

Loading data

  1. Since we can only load 32 elements per load, we create an accumulator vector with the total number of elements equals to a single tile (for 8x16 tile it will be 4x32).
  2. We then sequentially load elements for a single tile and push them to the accumulator vector
  3. Once the vector is full we cast it to the tile shape (4x32 -> 8x16)
  4. Repeat the process until all tiles are loaded.
MLIR example
////// LOAD

// loadScatterDescTiles(loadTiles=[desc0-desc3, desc8-desc11, desc4-desc7, desc12-desc15], tileShape=[8, 16]) produces

// vector that will store a single 8x16 tile
%flatAccum = arith.constant dense<0.0> : vector<8*16=128xf16>
// each row in 'accum' is a single load
%accum = vector.shape_cast %flatAccum : vector<128xf16> to vector<4x32xf16>

// load first 8x16 tile ([rowTileIdx=0, colTileIdx=0])
%load0 = xegpu.load %desc0, %mask : vector<32xf16>
%accum = vector.insert %load0, %accum[0] : vector<4x32xf16>

%load1 = xegpu.load %desc1,      %mask : vector<32xf16>
%accum = vector.insert %load1, %accum[1] : vector<4x32xf16>

%load2 = xegpu.load %desc2,      %mask : vector<32xf16>
%accum = vector.insert %load2, %accum[2] : vector<4x32xf16>

%load3 = xegpu.load %desc3,      %mask : vector<32xf16>
%accum = vector.insert %load3, %accum[3] : vector<4x32xf16>

%accum = vector.shape_cast %accum : vector<4x32xf16> to vector<128xf16>
%loadedTile0 = vector.shape_cast %accum : vector<128xf16> to vector<8x16xf16>

// load second 8x16 tile ([rowTileIdx=0, colTileIdx=1])
%load4 = xegpu.load %desc8, %mask : vector<32xf16>
%accum = vector.insert %load4, %accum[0] : vector<4x32xf16>

%load5 = xegpu.load %desc9,      %mask : vector<32xf16>
%accum = vector.insert %load5, %accum[1] : vector<4x32xf16>

%load6 = xegpu.load %desc10,      %mask : vector<32xf16>
%accum = vector.insert %load7, %accum[2] : vector<4x32xf16>

%load7 = xegpu.load %desc11,      %mask : vector<32xf16>
%accum = vector.insert %load7, %accum[3] : vector<4x32xf16>

%accum = vector.shape_cast %accum : vector<4x32xf16> to vector<128xf16>
%loadedTile1 = vector.shape_cast %accum : vector<128xf16> to vector<8x16xf16>

// load third 8x16 tile ([rowTileIdx=1, colTileIdx=0])
%load8 = xegpu.load %desc4, %mask : vector<32xf16>
%accum = vector.insert %load8, %accum[0] : vector<4x32xf16>

%load9 = xegpu.load %desc5,      %mask : vector<32xf16>
%accum = vector.insert %load9, %accum[1] : vector<4x32xf16>

%load10 = xegpu.load %desc6,      %mask : vector<32xf16>
%accum = vector.insert %load10, %accum[2] : vector<4x32xf16>

%load11 = xegpu.load %desc7,      %mask : vector<32xf16>
%accum = vector.insert %load11, %accum[3] : vector<4x32xf16>

%accum = vector.shape_cast %accum : vector<4x32xf16> to vector<128xf16>
%loadedTile2 = vector.shape_cast %accum : vector<128xf16> to vector<8x16xf16>

// load fourth 8x16 tile ([rowTileIdx=1, colTileIdx=1])
%load12 = xegpu.load %desc12, %mask : vector<32xf16>
%accum = vector.insert %load12, %accum[0] : vector<4x32xf16>

%load13 = xegpu.load %desc13,      %mask : vector<32xf16>
%accum = vector.insert %load13, %accum[1] : vector<4x32xf16>

%load14 = xegpu.load %desc14,      %mask : vector<32xf16>
%accum = vector.insert %load14, %accum[2] : vector<4x32xf16>

%load15 = xegpu.load %desc15,      %mask : vector<32xf16>
%accum = vector.insert %load15, %accum[3] : vector<4x32xf16>

%accum = vector.shape_cast %accum : vector<4x32xf16> to vector<128xf16>
%loadedTile3 = vector.shape_cast %accum : vector<128xf16> to vector<8x16xf16>

// loadScatterDescTiles() returns [%loadedTile0, %loadedTile1, %loadedTile2, %loadedTile3]

Storing data

  1. Since we can only store 32 elements per one store, we first flatten all the vector tiles (8x16 -> 128)
  2. And then extract slices of 32 elements from the flattened vector and store them.
MLIR example
////// STORE

// storeScatterDescTiles(results=[%loadedTile0, %loadedTile1, %loadedTile2, %loadedTile3],
//                       storeTiles=[desc0-desc3, desc8-desc11, desc4-desc7, desc12-desc15]) produces

// store first 8x16 tile ([rowTileIdx=0, colTileIdx=0])
%flatResult0 = vector.shape_cast %loadedTile0 : vector<8x16xf16> to vector<128xf16>

%store0 = vector.extract_strided_slice %flatResult0 : offset = [0], size = [32] -> vector<32xf16>
xegpu.store %store0, %desc0

%store1 = vector.extract_strided_slice %flatResult0 : offset = [32], size = [32] -> vector<32xf16>
xegpu.store %store1, %desc1

%store2 = vector.extract_strided_slice %flatResult0 : offset = [64], size = [32] -> vector<32xf16>
xegpu.store %store2, %desc2

%store3 = vector.extract_strided_slice %flatResult0 : offset = [96], size = [32] -> vector<32xf16>
xegpu.store %store3, %desc3

// store second 8x16 tile ([rowTileIdx=0, colTileIdx=1])
%flatResult1 = vector.shape_cast %loadedTile1 : vector<8x16xf16> to vector<128xf16>

%store4 = vector.extract_strided_slice %flatResult1 : offset = [0], size = [32] -> vector<32xf16>
xegpu.store %store4, %desc8

%store5 = vector.extract_strided_slice %flatResult1 : offset = [32], size = [32] -> vector<32xf16>
xegpu.store %store5, %desc9

%store6 = vector.extract_strided_slice %flatResult1 : offset = [64], size = [32] -> vector<32xf16>
xegpu.store %store6, %desc10

%store7 = vector.extract_strided_slice %flatResult1 : offset = [96], size = [32] -> vector<32xf16>
xegpu.store %store7, %desc11

// store third 8x16 tile ([rowTileIdx=1, colTileIdx=0])
%flatResult2 = vector.shape_cast %loadedTile2 : vector<8x16xf16> to vector<128xf16>

%store8 = vector.extract_strided_slice %flatResult2 : offset = [0], size = [32] -> vector<32xf16>
xegpu.store %store8, %desc4

%store9 = vector.extract_strided_slice %flatResult2 : offset = [32], size = [32] -> vector<32xf16>
xegpu.store %store9, %desc5

%store10 = vector.extract_strided_slice %flatResult2 : offset = [64], size = [32] -> vector<32xf16>
xegpu.store %store10, %desc6

%store11 = vector.extract_strided_slice %flatResult2 : offset = [96], size = [32] -> vector<32xf16>
xegpu.store %store11, %desc7

// store fourth 8x16 tile ([rowTileIdx=1, colTileIdx=1])
%flatResult3 = vector.shape_cast %loadedTile3 : vector<8x16xf16> to vector<128xf16>

%store12 = vector.extract_strided_slice %flatResult3 : offset = [0], size = [32] -> vector<32xf16>
xegpu.store %store12, %desc12

%store13 = vector.extract_strided_slice %flatResult3 : offset = [32], size = [32] -> vector<32xf16>
xegpu.store %store13, %desc13

%store14 = vector.extract_strided_slice %flatResult3 : offset = [64], size = [32] -> vector<32xf16>
xegpu.store %store14, %desc14

%store15 = vector.extract_strided_slice %flatResult3 : offset = [96], size = [32] -> vector<32xf16>
xegpu.store %store15, %desc15

As you can notice there's a lot of efforts required to load/store tiles from SLM. Even loading/storing a single 16x16 block requires 8 loads + 8 vector.insert ops + 8 stores + 8 vector.extract_strided_slice ops. It seems that it won't perform very well and that we should avoid using SLM where possible (through ops-fusion for example)

pm.addNestedPass<func::FuncOp>(createLinalgToXeGPU(
{/*kTile=*/16, /*stages=*/1, /*dpasTiles=*/{8, 16, 16}}));
pm.addPass(createCSEPass());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added CSE pass to minimize impact of the insert/extract manipulations with vectors


imex::InsertGPUAllocsOptions insertGPUAllocsOption{
/*clientAPI*/ "opencl", /*inRegions*/ false,
/*isUsmArgs*/ pipelineOpts.isUsmArgs};
pm.addNestedPass<func::FuncOp>(
imex::createInsertGPUAllocsPass(insertGPUAllocsOption));
pm.addPass(createGpuKernelOutliningPass());
pm.addPass(createCanonicalizerPass());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canonicalizer converts vector.from_elements [%val, %val, ... %val] into vector.splat %val that causes the imex::ConvertGPUXToSPIRVPass to fail (it seems it doesn't support vector.splat). So removed canonicalizer

@dchigarev dchigarev marked this pull request as ready for review November 14, 2024 12:07
Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some questions and comments inlined

Comment on lines 103 to 107
if (!sgMap) {
// Assuming default tensor descriptor type (blocked & in global memory).
return xegpu::TensorDescType::get(shape, elementType, /*array_length=*/1,
/*boundary_check=*/true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgmap shouldn't have anything to do with the type of the descriptor type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two types of tensor attributes that are called sg_map in the implementation of xeGPU dialect:

  1. ScatterTensorDescAttr - for scatter descriptors
  2. BlockTensorDescAttr - for block descriptors

They describe two kinds (the type is the same indeed) of descriptors and the kind depends on sg_map

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg_map has nothing to do with the tensor descriptor attribute (they are not called sg_map), it is a separate attribute that describes data chunks access by individual threads withing a subgroup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, okay :)

renamed sgMap -> descAttr

@@ -150,5 +151,47 @@ std::pair<Value, Value> getPtrAndOffset(OpBuilder &builder, Value operand) {
return std::make_pair(alignedPointer, offset);
}

Value flattenMemref(PatternRewriter &rewriter, Location loc, Value srcMemref) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I saw something very similar in LowerQuantOps.cpp. Maybe reuse is possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I saw something very similar in LowerQuantOps.cpp

We don't have this file in our project. What you're referring to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, found it in LLVM.

They flatten tensors there, not memrefs

assert(llvm::all_of(storeTiles,
[&](Value tile) { return tile.getType() == tileType; }) &&
"All load tiles must have the same type.");
assert(tileType.getShape().size() == 1 && "Scatter tiles must be 1D");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also coming from lowering restrictions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it's xegpu limitation. SLM for f16 can only be loaded/stored via 1D scatter descriptors

Comment on lines +1045 to +1046
// Do we need those for SLM?
/*l1_hint=*/hint, /*l2_hint=*/hint, /*l3_hint=*/hint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, will need to double-check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, if nothing crashes with them, i think we can keep them :D

Comment on lines +912 to +913
// The shape to be loaded is split into the largest 2D loads supported
// by the hardware.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to, say, 1d tensors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I would assume it will crash in the exact same way as the current linalg-to-xegpu lowering does

An attempt to use linalg-to-xegpu pass with 1D tensors/memrefs on the current main branch
gc-opt: /home/jovyan/llvm/mlir/lib/Dialect/XeGPU/IR/XeGPUOps.cpp:83: static void mlir::xegpu::CreateNdDescOp::build(mlir::OpBuilder&, mlir::OperationState&, mlir::Type, mlir::TypedValue<mlir::MemRefType>, llvm::ArrayRef<mlir::OpFoldResult>): Assertion `ty.hasStaticShape() && offsets.size() == (size_t)ty.getRank()' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: ./bin/gc-opt /home/jovyan/graph-compiler/test/mlir/test/gc/Transforms/GPU/linalg-to-xegpu1d.mlir "-linalg-to-xegpu=dpas-tile=8,16,16 k-tile=16" -canonicalize -split-input-file
 #0 0x00005571ec59bb30 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (./bin/gc-opt+0x589bb30)
 #1 0x00005571ec598f3f llvm::sys::RunSignalHandlers() (./bin/gc-opt+0x5898f3f)
 #2 0x00005571ec599095 SignalHandler(int) Signals.cpp:0:0
 #3 0x00007fd97d43f520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x00007fd97d4939fc __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #5 0x00007fd97d4939fc __pthread_kill_internal ./nptl/pthread_kill.c:78:10
 #6 0x00007fd97d4939fc pthread_kill ./nptl/pthread_kill.c:89:10
 #7 0x00007fd97d43f476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #8 0x00007fd97d4257f3 abort ./stdlib/abort.c:81:7
 #9 0x00007fd97d42571b _nl_load_domain ./intl/loadmsgcat.c:1177:9
#10 0x00007fd97d436e96 (/lib/x86_64-linux-gnu/libc.so.6+0x39e96)
#11 0x00005571e931f149 mlir::xegpu::CreateNdDescOp::build(mlir::OpBuilder&, mlir::OperationState&, mlir::Type, mlir::detail::TypedValue<mlir::MemRefType>, llvm::ArrayRef<mlir::OpFoldResult>) (./bin/gc-opt+0x261f149)
#12 0x00005571e9a4de94 mlir::xegpu::CreateNdDescOp mlir::OpBuilder::create<mlir::xegpu::CreateNdDescOp, mlir::xegpu::TensorDescType&, mlir::detail::TypedValue<mlir::MemRefType>, llvm::SmallVector<mlir::OpFoldResult, 6u>&>(mlir::Location, mlir::xegpu::TensorDescType&, mlir::detail::TypedValue<mlir::MemRefType>&&, llvm::SmallVector<mlir::OpFoldResult, 6u>&) /home/jovyan/llvm/llvm-install-imex-17_oct/include/mlir/IR/Builders.h:517:22
#13 0x00005571e9a2af4e (anonymous namespace)::createDescriptorTiles(mlir::PatternRewriter&, mlir::Location, mlir::Value, llvm::ArrayRef<long>, llvm::ArrayRef<long>, llvm::ArrayRef<long>, int, bool) /home/jovyan/graph-compiler/lib/gc/Transforms/GPU/LinalgToXeGPU.cpp:661:0
#14 0x00005571e9a2b585 (anonymous namespace)::createCoarseDscTiles(mlir::PatternRewriter&, mlir::Location, mlir::Value, llvm::ArrayRef<long>, bool, bool) /home/jovyan/graph-compiler/lib/gc/Transforms/GPU/LinalgToXeGPU.cpp:735:0
#15 0x00005571e9a2fa31 (anonymous namespace)::createEltwiseKernel(mlir::linalg::LinalgOp, mlir::PatternRewriter&) /home/jovyan/graph-compiler/lib/gc/Transforms/GPU/LinalgToXeGPU.cpp:1321:0
#16 0x00005571e9a42f80 (anonymous namespace)::ConvertNamedEltwiseToXeGPU<mlir::linalg::AddOp>::matchAndRewrite(mlir::linalg::AddOp, mlir::PatternRewriter&) const /home/jovyan/graph-compiler/lib/gc/Transforms/GPU/LinalgToXeGPU.cpp:1494:0
#17 0x00005571e9a6590c mlir::detail::OpOrInterfaceRewritePatternBase<mlir::linalg::AddOp>::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&) const /home/jovyan/llvm/llvm-install-imex-17_oct/include/mlir/IR/PatternMatch.h:332:3
#18 0x00005571ec0c6bc8 mlir::PatternApplicator::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&, llvm::function_ref<bool (mlir::Pattern const&)>, llvm::function_ref<void (mlir::Pattern const&)>, llvm::function_ref<llvm::LogicalResult (mlir::Pattern const&)>) (./bin/gc-opt+0x53c6bc8)
#19 0x00005571ec08f3de (anonymous namespace)::GreedyPatternRewriteDriver::processWorklist() GreedyPatternRewriteDriver.cpp:0:0
#20 0x00005571ec091be5 mlir::applyPatternsAndFoldGreedily(mlir::Region&, mlir::FrozenRewritePatternSet const&, mlir::GreedyRewriteConfig, bool*) (./bin/gc-opt+0x5391be5)
#21 0x00005571e9915394 mlir::applyPatternsAndFoldGreedily(mlir::Operation*, mlir::FrozenRewritePatternSet const&, mlir::GreedyRewriteConfig, bool*) /home/jovyan/llvm/llvm-install-imex-17_oct/include/mlir/Transforms/GreedyPatternRewriteDriver.h:159:37
#22 0x00005571e9a30f6e (anonymous namespace)::LinalgToXeGPU::runOnOperation() /home/jovyan/graph-compiler/lib/gc/Transforms/GPU/LinalgToXeGPU.cpp:1649:0
#23 0x00005571ec1c3479 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (./bin/gc-opt+0x54c3479)
#24 0x00005571ec1c3931 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (./bin/gc-opt+0x54c3931)
#25 0x00005571ec1c3cd6 mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::'lambda'(mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo&)::operator()(mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo&) const Pass.cpp:0:0
#26 0x00005571ec1c29a5 mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) (./bin/gc-opt+0x54c29a5)
#27 0x00005571ec1c3280 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (./bin/gc-opt+0x54c3280)
#28 0x00005571ec1c3931 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (./bin/gc-opt+0x54c3931)
#29 0x00005571ec1c4995 mlir::PassManager::run(mlir::Operation*) (./bin/gc-opt+0x54c4995)
#30 0x00005571e98c5217 performActions(llvm::raw_ostream&, std::shared_ptr<llvm::SourceMgr> const&, mlir::MLIRContext*, mlir::MlirOptMainConfig const&) MlirOptMain.cpp:0:0
#31 0x00005571e98c5c2c processBuffer(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, mlir::MlirOptMainConfig const&, mlir::DialectRegistry&, llvm::ThreadPoolInterface*) MlirOptMain.cpp:0:0
#32 0x00005571e98c5d8d llvm::LogicalResult llvm::function_ref<llvm::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>::callback_fn<mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, mlir::DialectRegistry&, mlir::MlirOptMainConfig const&)::'lambda'(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>(long, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&) MlirOptMain.cpp:0:0
#33 0x00005571ec467b1f mlir::splitAndProcessBuffer(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<llvm::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>, llvm::raw_ostream&, llvm::StringRef, llvm::StringRef)::'lambda'(llvm::StringRef)::operator()(llvm::StringRef) const ToolUtilities.cpp:0:0
#34 0x00005571ec468472 mlir::splitAndProcessBuffer(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<llvm::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>, llvm::raw_ostream&, llvm::StringRef, llvm::StringRef) (./bin/gc-opt+0x5768472)
#35 0x00005571e98bd56c mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, mlir::DialectRegistry&, mlir::MlirOptMainConfig const&) (./bin/gc-opt+0x2bbd56c)
#36 0x00005571e98c5ef0 mlir::MlirOptMain(int, char**, llvm::StringRef, llvm::StringRef, mlir::DialectRegistry&) (./bin/gc-opt+0x2bc5ef0)
#37 0x00005571e98c6417 mlir::MlirOptMain(int, char**, llvm::StringRef, mlir::DialectRegistry&) (./bin/gc-opt+0x2bc6417)
#38 0x00005571e70d410c main /home/jovyan/graph-compiler/src/gc-opt/gc-opt.cpp:75:0
#39 0x00007fd97d426d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#40 0x00007fd97d426e40 call_init ./csu/../csu/libc-start.c:128:20
#41 0x00007fd97d426e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#42 0x00005571e70d3cf5 _start (./bin/gc-opt+0x3d3cf5)
Aborted (core dumped)

(1D is not supported by linalg-to-xegpu)

Signed-off-by: dchigarev <[email protected]>
@dchigarev dchigarev merged commit 672edc9 into intel:main Nov 15, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Properly handle SLM memory at linalg-to-xegpu pass
3 participants