Skip to content

[Tensor] Simplify tenor.pad tiling length calculations. #119039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 12, 2024

Conversation

nirvedhmeshram
Copy link
Contributor

@nirvedhmeshram nirvedhmeshram commented Dec 6, 2024

The current calculations calculate ending location of the new length and then subtract the new offset from that location. It is possible to directly calculate new length. Along with requiring less operations (which can matter in dynamic case) this also has the advantage that the values are upper bounded by length rather than source size which is more friendly for range analysis. I believe the change is already being tested by
test/Dialect/Linalg/subtensor-of-padtensor.mlir and test/Dialect/Linalg/tile-and-fuse-tensors.mlir

@llvmbot
Copy link
Member

llvmbot commented Dec 6, 2024

@llvm/pr-subscribers-mlir-tensor

@llvm/pr-subscribers-mlir

Author: Nirvedh Meshram (nirvedhmeshram)

Changes

The current calculations calculate ending location of the tiled length and then subtract the new offset from that location. It is possible to directly caculate length. Along with require less operations (which can matter in dynamic case) this also has the advantage that the values are upper bounded by length rather than source size which is more friendly for range analysis. I believe the change is already being tested by
test/Dialect/Linalg/subtensor-of-padtensor.mlir and test/Dialect/Linalg/tile-and-fuse-tensors.mlir


Full diff: https://github.com/llvm/llvm-project/pull/119039.diff

1 Files Affected:

  • (modified) mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp (+7-15)
diff --git a/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp b/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
index 68c3d1cabb11cb..e647e53bc33af6 100644
--- a/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
+++ b/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
@@ -746,11 +746,6 @@ FailureOr<TilingResult> tensor::bubbleUpPadSlice(OpBuilder &b,
   Location loc = padOp->getLoc();
   AffineExpr dim0, dim1;
   bindDims(b.getContext(), dim0, dim1);
-  // Add two integers.
-  auto addMap = AffineMap::get(2, 0, {dim0 + dim1});
-  auto add = [&](OpFoldResult v1, OpFoldResult v2) {
-    return affine::makeComposedFoldedAffineApply(b, loc, addMap, {v1, v2});
-  };
   // Subtract two integers.
   auto subMap = AffineMap::get(2, 0, {dim0 - dim1});
   auto sub = [&](OpFoldResult v1, OpFoldResult v2) {
@@ -825,16 +820,13 @@ FailureOr<TilingResult> tensor::bubbleUpPadSlice(OpBuilder &b,
     // The original read could also have stopped in the high padding zone.
     // In that case, set the end positition of the read should be the end of
     // the source tensor. (Similar to newOffset.)
-    //
-    // endLoc = min(max(offset - low + length, 0), srcSize)
-    //
-    // The new ExtractSliceOp length is `endLoc - newOffset`.
-    //
-    // Optimization: If low = 0, then the formula can be simplified.
-    OpFoldResult endLoc =
-        hasLowPad ? min(max(add(sub(offset, low), length), zero), srcSize)
-                  : min(add(offset, length), srcSize);
-    OpFoldResult newLength = sub(endLoc, newOffset);
+    // srcSize - newOffset represents how much length we have available
+    // and length - newLow represents how much length we want at most.
+    OpFoldResult newLength = min(sub(srcSize, newOffset), sub(length, newLow));
+    // Optimization: If low = 0, then newLow = 0. then newLength >= 0 assuming
+    // length >= 0.
+    if (hasLowPad)
+      newLength = max(newLength, zero);
     newLengths.push_back(newLength);
 
     // Check if newLength is zero. In that case, no SubTensorOp should be

Copy link
Contributor

@Max191 Max191 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. It is a bit strange that there are no lit test changes, but I'll approve since it has been tested in a downstream project, and a lit test for all this indexing math is probably a bit of a change detector test anyway.

@nirvedhmeshram nirvedhmeshram force-pushed the simplify_pad_tiling_length branch from 1af0920 to 473c00b Compare December 12, 2024 17:01
Signed-off-by: Nirvedh <[email protected]>
@nirvedhmeshram nirvedhmeshram merged commit 3f136f7 into llvm:main Dec 12, 2024
3 of 6 checks passed
Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff f9e11501841fc602488fea78b88910eab7d4d396 865958f0c9fc63aa28971d3387d16f2e5f8d84a7 --extensions cpp -- mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
View the diff from clang-format here.
diff --git a/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp b/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
index 3caf93b140..6e63cf068b 100644
--- a/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
+++ b/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
@@ -822,7 +822,13 @@ FailureOr<TilingResult> tensor::bubbleUpPadSlice(OpBuilder &b,
     // the source tensor. (Similar to newOffset.)
     // srcSize - newOffset represents how much length we have available
     // and length - newLow represents how much length we want at most.
-    // Note that there are many ways to order this indexing math to compute newLength, but we want to make sure that the final affine.min ops in the sequence are bounding the index to as small a value as possible. If ValueBoundsOpInterface is used, this calcuation will get upper bounds from the affine.min ops, so we want to use the smallest known value to set the bound at the end of the computation sequence. In this case, the index will be upper bounded by length - newLow.
+    // Note that there are many ways to order this indexing math to compute
+    // newLength, but we want to make sure that the final affine.min ops in the
+    // sequence are bounding the index to as small a value as possible. If
+    // ValueBoundsOpInterface is used, this calcuation will get upper bounds
+    // from the affine.min ops, so we want to use the smallest known value to
+    // set the bound at the end of the computation sequence. In this case, the
+    // index will be upper bounded by length - newLow.
     OpFoldResult newLength = min(sub(srcSize, newOffset), sub(length, newLow));
     // Optimization: If low = 0, then newLow = 0. then newLength >= 0 assuming
     // length >= 0.

nirvedhmeshram added a commit that referenced this pull request Dec 12, 2024
Missed commiting clang-fomrat in
[#19903](#119039)
nirvedhmeshram added a commit to iree-org/iree that referenced this pull request Dec 18, 2024
… shapes (#19484)

This PR does two things
1. Allow all GEMM shapes to use padded TileAndFuse Matmul configuration.
This is still behind the
`iree-codegen-llvmgpu-test-tile-and-fuse-matmul=false` flag by default
and does not change the default behavior. However following PRs that
have landed in the past month make it possible to relax the guards we
originally had on this.
#19196
#19307
llvm/llvm-project#117340
2. Allow fused producers to use use padded TileAndFuse Matmul
configuration. Following PRs make this possible now
#19399
llvm/llvm-project#119039

Together this allows us to do padded IGEMM with intrinsics for shapes
unaligned to intrinsic which we use by default.
[Here](https://docs.google.com/spreadsheets/d/1O-SdUZCn5pHsxx7JTGjIIdH6PWCFnvlfe4XBbjEBaIM/edit?gid=0#gid=0)
is the performance difference observed in conv cases in
iree-kernel-benchmark-module that utilize this change. A median speedup
of 2.26x was observed.

The numeric changes I observed with enabling this path were the same
between any aligned shape when comparing intrinsic vs no intrinsic use.
Generally some differences are noticed for narrow types like f16 but
they are within a relative error of 0.001 but since our tests use
absolute errors we may have to change some test values to account for
this change.

The perf difference in CI seem to be within noise margin compared to
main,
https://github.com/iree-org/iree/actions/runs/12323399269/attempts/1#summary-34399247902

---------

Signed-off-by: Nirvedh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants