-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[Tensor] Simplify tenor.pad tiling length calculations. #119039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tensor] Simplify tenor.pad tiling length calculations. #119039
Conversation
Signed-off-by: Nirvedh <[email protected]>
@llvm/pr-subscribers-mlir-tensor @llvm/pr-subscribers-mlir Author: Nirvedh Meshram (nirvedhmeshram) ChangesThe current calculations calculate ending location of the tiled length and then subtract the new offset from that location. It is possible to directly caculate length. Along with require less operations (which can matter in dynamic case) this also has the advantage that the values are upper bounded by length rather than source size which is more friendly for range analysis. I believe the change is already being tested by Full diff: https://github.com/llvm/llvm-project/pull/119039.diff 1 Files Affected:
diff --git a/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp b/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
index 68c3d1cabb11cb..e647e53bc33af6 100644
--- a/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
+++ b/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
@@ -746,11 +746,6 @@ FailureOr<TilingResult> tensor::bubbleUpPadSlice(OpBuilder &b,
Location loc = padOp->getLoc();
AffineExpr dim0, dim1;
bindDims(b.getContext(), dim0, dim1);
- // Add two integers.
- auto addMap = AffineMap::get(2, 0, {dim0 + dim1});
- auto add = [&](OpFoldResult v1, OpFoldResult v2) {
- return affine::makeComposedFoldedAffineApply(b, loc, addMap, {v1, v2});
- };
// Subtract two integers.
auto subMap = AffineMap::get(2, 0, {dim0 - dim1});
auto sub = [&](OpFoldResult v1, OpFoldResult v2) {
@@ -825,16 +820,13 @@ FailureOr<TilingResult> tensor::bubbleUpPadSlice(OpBuilder &b,
// The original read could also have stopped in the high padding zone.
// In that case, set the end positition of the read should be the end of
// the source tensor. (Similar to newOffset.)
- //
- // endLoc = min(max(offset - low + length, 0), srcSize)
- //
- // The new ExtractSliceOp length is `endLoc - newOffset`.
- //
- // Optimization: If low = 0, then the formula can be simplified.
- OpFoldResult endLoc =
- hasLowPad ? min(max(add(sub(offset, low), length), zero), srcSize)
- : min(add(offset, length), srcSize);
- OpFoldResult newLength = sub(endLoc, newOffset);
+ // srcSize - newOffset represents how much length we have available
+ // and length - newLow represents how much length we want at most.
+ OpFoldResult newLength = min(sub(srcSize, newOffset), sub(length, newLow));
+ // Optimization: If low = 0, then newLow = 0. then newLength >= 0 assuming
+ // length >= 0.
+ if (hasLowPad)
+ newLength = max(newLength, zero);
newLengths.push_back(newLength);
// Check if newLength is zero. In that case, no SubTensorOp should be
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. It is a bit strange that there are no lit test changes, but I'll approve since it has been tested in a downstream project, and a lit test for all this indexing math is probably a bit of a change detector test anyway.
Signed-off-by: Nirvedh <[email protected]>
1af0920
to
473c00b
Compare
Signed-off-by: Nirvedh <[email protected]>
You can test this locally with the following command:git-clang-format --diff f9e11501841fc602488fea78b88910eab7d4d396 865958f0c9fc63aa28971d3387d16f2e5f8d84a7 --extensions cpp -- mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp View the diff from clang-format here.diff --git a/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp b/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
index 3caf93b140..6e63cf068b 100644
--- a/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
+++ b/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp
@@ -822,7 +822,13 @@ FailureOr<TilingResult> tensor::bubbleUpPadSlice(OpBuilder &b,
// the source tensor. (Similar to newOffset.)
// srcSize - newOffset represents how much length we have available
// and length - newLow represents how much length we want at most.
- // Note that there are many ways to order this indexing math to compute newLength, but we want to make sure that the final affine.min ops in the sequence are bounding the index to as small a value as possible. If ValueBoundsOpInterface is used, this calcuation will get upper bounds from the affine.min ops, so we want to use the smallest known value to set the bound at the end of the computation sequence. In this case, the index will be upper bounded by length - newLow.
+ // Note that there are many ways to order this indexing math to compute
+ // newLength, but we want to make sure that the final affine.min ops in the
+ // sequence are bounding the index to as small a value as possible. If
+ // ValueBoundsOpInterface is used, this calcuation will get upper bounds
+ // from the affine.min ops, so we want to use the smallest known value to
+ // set the bound at the end of the computation sequence. In this case, the
+ // index will be upper bounded by length - newLow.
OpFoldResult newLength = min(sub(srcSize, newOffset), sub(length, newLow));
// Optimization: If low = 0, then newLow = 0. then newLength >= 0 assuming
// length >= 0.
|
… shapes (#19484) This PR does two things 1. Allow all GEMM shapes to use padded TileAndFuse Matmul configuration. This is still behind the `iree-codegen-llvmgpu-test-tile-and-fuse-matmul=false` flag by default and does not change the default behavior. However following PRs that have landed in the past month make it possible to relax the guards we originally had on this. #19196 #19307 llvm/llvm-project#117340 2. Allow fused producers to use use padded TileAndFuse Matmul configuration. Following PRs make this possible now #19399 llvm/llvm-project#119039 Together this allows us to do padded IGEMM with intrinsics for shapes unaligned to intrinsic which we use by default. [Here](https://docs.google.com/spreadsheets/d/1O-SdUZCn5pHsxx7JTGjIIdH6PWCFnvlfe4XBbjEBaIM/edit?gid=0#gid=0) is the performance difference observed in conv cases in iree-kernel-benchmark-module that utilize this change. A median speedup of 2.26x was observed. The numeric changes I observed with enabling this path were the same between any aligned shape when comparing intrinsic vs no intrinsic use. Generally some differences are noticed for narrow types like f16 but they are within a relative error of 0.001 but since our tests use absolute errors we may have to change some test values to account for this change. The perf difference in CI seem to be within noise margin compared to main, https://github.com/iree-org/iree/actions/runs/12323399269/attempts/1#summary-34399247902 --------- Signed-off-by: Nirvedh <[email protected]>
The current calculations calculate ending location of the new length and then subtract the new offset from that location. It is possible to directly calculate new length. Along with requiring less operations (which can matter in dynamic case) this also has the advantage that the values are upper bounded by length rather than source size which is more friendly for range analysis. I believe the change is already being tested by
test/Dialect/Linalg/subtensor-of-padtensor.mlir
andtest/Dialect/Linalg/tile-and-fuse-tensors.mlir