-
Notifications
You must be signed in to change notification settings - Fork 669
Move python/lib/core/bfloat16.h to TSL #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
c67f00b
to
3a84ae3
Compare
Part of OpenXLA refactoring PiperOrigin-RevId: 493756462
3a84ae3
to
9309e33
Compare
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 27, 2025
Imported from GitHub PR #24114 Converting FP8 <-> FP8 fails because the Triton compiler does not support it. The proposed fix will make the conversion go through FP16. Two questions: 1) Are there any better approaches of solving this? 2) I could not find a place to put unit tests for this, and in the code there is a comment saying: ``` // TODO(b/266862493): Add end-to-end test once FP8 support lands in XLA as // we can't test the code below without patching the feature. ``` Wondering if there is a place where I can add a test? ### Details When converting FP8 types, the XLA compiler emits a `fp_to_fp` Triton instruction. If the source type is FP8, no rounding strategy is specified. Concretely, this causes the following Triton to be emitted: <details> <summary> <code>%24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN></code> </summary> ``` module { tt.func @gemm_fusion_dot_320_impl(%arg0: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f8E5M2> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}) { %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf8E4M3FN> %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x64xf8E4M3FN> %c90_i32 = arith.constant 90 : i32 %c32000_i64 = arith.constant 32000 : i64 %c64_i32 = arith.constant 64 : i32 %c90_i64 = arith.constant 90 : i64 %c768_i64 = arith.constant 768 : i64 %c0_i32 = arith.constant 0 : i32 %c1_i64 = arith.constant 1 : i64 %c32_i32 = arith.constant 32 : i32 %c24_i32 = arith.constant 24 : i32 %c8_i32 = arith.constant 8 : i32 %c4000_i32 = arith.constant 4000 : i32 %cst_1 = arith.constant dense<0.000000e+00> : tensor<32x64xf32> %0 = tt.get_program_id x : i32 %1 = arith.divsi %0, %c4000_i32 : i32 %2 = arith.muli %1, %c8_i32 : i32 %3 = arith.subi %c24_i32, %2 : i32 %4 = arith.cmpi slt, %3, %c8_i32 : i32 %5 = arith.select %4, %3, %c8_i32 : i32 %6 = arith.remsi %0, %5 : i32 %7 = arith.addi %2, %6 : i32 %8 = arith.remsi %0, %c4000_i32 : i32 %9 = arith.divsi %8, %5 : i32 %10 = arith.muli %7, %c32_i32 : i32 %11 = tt.make_tensor_ptr %arg1, [%c768_i64, %c90_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E5M2>> %12 = tt.advance %11, [%10, %c0_i32] : <tensor<32x64xf8E5M2>> %13 = arith.muli %9, %c64_i32 : i32 %14 = tt.make_tensor_ptr %arg0, [%c90_i64, %c32000_i64], [%c1_i64, %c90_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x64xf8E4M3FN>> %15 = tt.advance %14, [%c0_i32, %13] : <tensor<64x64xf8E4M3FN>> %16:3 = scf.for %arg3 = %c0_i32 to %c90_i32 step %c64_i32 iter_args(%arg4 = %12, %arg5 = %15, %arg6 = %cst_1) -> (!tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32>) : i32 { %20 = tt.load %arg4 {boundaryCheck = array<i32: 1>, padding = 1 : i32} : !tt.ptr<tensor<32x64xf8E5M2>> %21 = tt.advance %arg4, [%c0_i32, %c64_i32] : <tensor<32x64xf8E5M2>> %22 = tt.load %arg5 {boundaryCheck = array<i32: 0>, padding = 1 : i32} : !tt.ptr<tensor<64x64xf8E4M3FN>> %23 = tt.advance %arg5, [%c64_i32, %c0_i32] : <tensor<64x64xf8E4M3FN>> %24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN> %25 = arith.subi %c90_i32, %arg3 : i32 %26 = arith.cmpi slt, %25, %c64_i32 : i32 %27 = scf.if %26 -> (tensor<32x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %32 = tt.splat %25 : i32 -> tensor<1x64xi32> %33 = arith.cmpi slt, %31, %32 : tensor<1x64xi32> %34 = tt.broadcast %33 : tensor<1x64xi1> -> tensor<32x64xi1> %35 = arith.select %34, %24, %cst_0 : tensor<32x64xi1>, tensor<32x64xf8E4M3FN> scf.yield %35 : tensor<32x64xf8E4M3FN> } else { scf.yield %24 : tensor<32x64xf8E4M3FN> } %28 = scf.if %26 -> (tensor<64x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %32 = tt.splat %25 : i32 -> tensor<64x1xi32> %33 = arith.cmpi slt, %31, %32 : tensor<64x1xi32> %34 = tt.broadcast %33 : tensor<64x1xi1> -> tensor<64x64xi1> %35 = arith.select %34, %22, %cst : tensor<64x64xi1>, tensor<64x64xf8E4M3FN> scf.yield %35 : tensor<64x64xf8E4M3FN> } else { scf.yield %22 : tensor<64x64xf8E4M3FN> } %29 = tt.dot %27, %28, %arg6, inputPrecision = tf32 {maxNumImpreciseAcc = 2147483647 : i32} : tensor<32x64xf8E4M3FN> * tensor<64x64xf8E4M3FN> -> tensor<32x64xf32> scf.yield %21, %23, %29 : !tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32> } %17 = tt.fp_to_fp %16#2, rounding = rtne : tensor<32x64xf32> -> tensor<32x64xf8E4M3FN> %18 = tt.make_tensor_ptr %arg2, [%c768_i64, %c32000_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E4M3FN>> %19 = tt.advance %18, [%10, %13] : <tensor<32x64xf8E4M3FN>> tt.store %19, %17 : !tt.ptr<tensor<32x64xf8E4M3FN>> tt.return } } ``` </details> Which leads to a failing assertion: ``` #0 0x000073413786d9fc in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000734137819476 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007341377ff7f3 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007341377ff71b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x0000734137810e96 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x000057d936b1777b in mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion::createDestOps (this=0x733d08425cc0, op=..., adaptor=..., rewriter=..., elemTy=..., operands=..., loc=...) at external/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ElementwiseOpToLLVM.cpp:500 #6 0x000057d936b17195 in mlir::triton::gpu::ElementwiseOpConversionBase<mlir::triton::FpToFpOp, mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion>::matchAndRewrite (this=0x733d08425cc0, op=..., adaptor=..., rewriter=...) at external/triton/include/triton/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVMBase.h:188 [...] #29 0x000057d93fa6cade in mlir::PassManager::run (this=0x733e80fba158, op=0x733d080bbc20) at external/llvm-project/mlir/lib/Pass/Pass.cpp:885 #30 0x000057d9363f6b1b in xla::gpu::CompileTritonToLLVM (hlo_config=..., hlo_module_name="gemm_fusion_dot.320", device_info=..., block_level_parameters=..., triton_module=..., llvm_module=0x733d0816d6a0, mlir_context=..., is_xla_fusion=true, emit_kernel=true) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1627 #31 0x000057d9363f5a5d in xla::gpu::TritonWrapper (fn_name="gemm_fusion_dot_320_impl", fusion=0x733d080a31c0, cc=std::variant<stream_executor::CudaComputeCapability, stream_executor::RocmComputeCapability> [index 0] = {...}, device_info=..., block_level_parameters=..., llvm_module=0x733d0816d6a0, mlir_context=...) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1531 ``` However, this fails Triton compilation: * First it hits an assertion that the rounding strategy when the destination type is FP8 must be specified * Adding the rounding strategy, then goes on to another issue, that no methods for converting FP8 <-> FP8 are specified To work around the above two issues, I propose going through FP16 when both the source and destination types are FP8's. Copybara import of the project: -- afd3929 by Kasper Nielsen <[email protected]>: Fix fused fp8 <-> fp8 conversions -- 66340aa by Kasper Nielsen <[email protected]>: Add unit tests and refactor duplicated code -- 07ae307 by Kasper Nielsen <[email protected]>: Run clang-format Merging this change closes #24114 FUTURE_COPYBARA_INTEGRATE_REVIEW=#24114 from kasper0406:kn/fp8-conversion-fix 07ae307 PiperOrigin-RevId: 741162069
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 27, 2025
Imported from GitHub PR #24114 Converting FP8 <-> FP8 fails because the Triton compiler does not support it. The proposed fix will make the conversion go through FP16. Two questions: 1) Are there any better approaches of solving this? 2) I could not find a place to put unit tests for this, and in the code there is a comment saying: ``` // TODO(b/266862493): Add end-to-end test once FP8 support lands in XLA as // we can't test the code below without patching the feature. ``` Wondering if there is a place where I can add a test? ### Details When converting FP8 types, the XLA compiler emits a `fp_to_fp` Triton instruction. If the source type is FP8, no rounding strategy is specified. Concretely, this causes the following Triton to be emitted: <details> <summary> <code>%24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN></code> </summary> ``` module { tt.func @gemm_fusion_dot_320_impl(%arg0: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f8E5M2> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}) { %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf8E4M3FN> %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x64xf8E4M3FN> %c90_i32 = arith.constant 90 : i32 %c32000_i64 = arith.constant 32000 : i64 %c64_i32 = arith.constant 64 : i32 %c90_i64 = arith.constant 90 : i64 %c768_i64 = arith.constant 768 : i64 %c0_i32 = arith.constant 0 : i32 %c1_i64 = arith.constant 1 : i64 %c32_i32 = arith.constant 32 : i32 %c24_i32 = arith.constant 24 : i32 %c8_i32 = arith.constant 8 : i32 %c4000_i32 = arith.constant 4000 : i32 %cst_1 = arith.constant dense<0.000000e+00> : tensor<32x64xf32> %0 = tt.get_program_id x : i32 %1 = arith.divsi %0, %c4000_i32 : i32 %2 = arith.muli %1, %c8_i32 : i32 %3 = arith.subi %c24_i32, %2 : i32 %4 = arith.cmpi slt, %3, %c8_i32 : i32 %5 = arith.select %4, %3, %c8_i32 : i32 %6 = arith.remsi %0, %5 : i32 %7 = arith.addi %2, %6 : i32 %8 = arith.remsi %0, %c4000_i32 : i32 %9 = arith.divsi %8, %5 : i32 %10 = arith.muli %7, %c32_i32 : i32 %11 = tt.make_tensor_ptr %arg1, [%c768_i64, %c90_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E5M2>> %12 = tt.advance %11, [%10, %c0_i32] : <tensor<32x64xf8E5M2>> %13 = arith.muli %9, %c64_i32 : i32 %14 = tt.make_tensor_ptr %arg0, [%c90_i64, %c32000_i64], [%c1_i64, %c90_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x64xf8E4M3FN>> %15 = tt.advance %14, [%c0_i32, %13] : <tensor<64x64xf8E4M3FN>> %16:3 = scf.for %arg3 = %c0_i32 to %c90_i32 step %c64_i32 iter_args(%arg4 = %12, %arg5 = %15, %arg6 = %cst_1) -> (!tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32>) : i32 { %20 = tt.load %arg4 {boundaryCheck = array<i32: 1>, padding = 1 : i32} : !tt.ptr<tensor<32x64xf8E5M2>> %21 = tt.advance %arg4, [%c0_i32, %c64_i32] : <tensor<32x64xf8E5M2>> %22 = tt.load %arg5 {boundaryCheck = array<i32: 0>, padding = 1 : i32} : !tt.ptr<tensor<64x64xf8E4M3FN>> %23 = tt.advance %arg5, [%c64_i32, %c0_i32] : <tensor<64x64xf8E4M3FN>> %24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN> %25 = arith.subi %c90_i32, %arg3 : i32 %26 = arith.cmpi slt, %25, %c64_i32 : i32 %27 = scf.if %26 -> (tensor<32x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %32 = tt.splat %25 : i32 -> tensor<1x64xi32> %33 = arith.cmpi slt, %31, %32 : tensor<1x64xi32> %34 = tt.broadcast %33 : tensor<1x64xi1> -> tensor<32x64xi1> %35 = arith.select %34, %24, %cst_0 : tensor<32x64xi1>, tensor<32x64xf8E4M3FN> scf.yield %35 : tensor<32x64xf8E4M3FN> } else { scf.yield %24 : tensor<32x64xf8E4M3FN> } %28 = scf.if %26 -> (tensor<64x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %32 = tt.splat %25 : i32 -> tensor<64x1xi32> %33 = arith.cmpi slt, %31, %32 : tensor<64x1xi32> %34 = tt.broadcast %33 : tensor<64x1xi1> -> tensor<64x64xi1> %35 = arith.select %34, %22, %cst : tensor<64x64xi1>, tensor<64x64xf8E4M3FN> scf.yield %35 : tensor<64x64xf8E4M3FN> } else { scf.yield %22 : tensor<64x64xf8E4M3FN> } %29 = tt.dot %27, %28, %arg6, inputPrecision = tf32 {maxNumImpreciseAcc = 2147483647 : i32} : tensor<32x64xf8E4M3FN> * tensor<64x64xf8E4M3FN> -> tensor<32x64xf32> scf.yield %21, %23, %29 : !tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32> } %17 = tt.fp_to_fp %16#2, rounding = rtne : tensor<32x64xf32> -> tensor<32x64xf8E4M3FN> %18 = tt.make_tensor_ptr %arg2, [%c768_i64, %c32000_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E4M3FN>> %19 = tt.advance %18, [%10, %13] : <tensor<32x64xf8E4M3FN>> tt.store %19, %17 : !tt.ptr<tensor<32x64xf8E4M3FN>> tt.return } } ``` </details> Which leads to a failing assertion: ``` #0 0x000073413786d9fc in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000734137819476 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007341377ff7f3 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007341377ff71b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x0000734137810e96 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x000057d936b1777b in mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion::createDestOps (this=0x733d08425cc0, op=..., adaptor=..., rewriter=..., elemTy=..., operands=..., loc=...) at external/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ElementwiseOpToLLVM.cpp:500 #6 0x000057d936b17195 in mlir::triton::gpu::ElementwiseOpConversionBase<mlir::triton::FpToFpOp, mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion>::matchAndRewrite (this=0x733d08425cc0, op=..., adaptor=..., rewriter=...) at external/triton/include/triton/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVMBase.h:188 [...] #29 0x000057d93fa6cade in mlir::PassManager::run (this=0x733e80fba158, op=0x733d080bbc20) at external/llvm-project/mlir/lib/Pass/Pass.cpp:885 #30 0x000057d9363f6b1b in xla::gpu::CompileTritonToLLVM (hlo_config=..., hlo_module_name="gemm_fusion_dot.320", device_info=..., block_level_parameters=..., triton_module=..., llvm_module=0x733d0816d6a0, mlir_context=..., is_xla_fusion=true, emit_kernel=true) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1627 #31 0x000057d9363f5a5d in xla::gpu::TritonWrapper (fn_name="gemm_fusion_dot_320_impl", fusion=0x733d080a31c0, cc=std::variant<stream_executor::CudaComputeCapability, stream_executor::RocmComputeCapability> [index 0] = {...}, device_info=..., block_level_parameters=..., llvm_module=0x733d0816d6a0, mlir_context=...) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1531 ``` However, this fails Triton compilation: * First it hits an assertion that the rounding strategy when the destination type is FP8 must be specified * Adding the rounding strategy, then goes on to another issue, that no methods for converting FP8 <-> FP8 are specified To work around the above two issues, I propose going through FP16 when both the source and destination types are FP8's. Copybara import of the project: -- afd3929 by Kasper Nielsen <[email protected]>: Fix fused fp8 <-> fp8 conversions -- 66340aa by Kasper Nielsen <[email protected]>: Add unit tests and refactor duplicated code -- 07ae307 by Kasper Nielsen <[email protected]>: Run clang-format Merging this change closes #24114 FUTURE_COPYBARA_INTEGRATE_REVIEW=#24114 from kasper0406:kn/fp8-conversion-fix 07ae307 PiperOrigin-RevId: 741162069
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 27, 2025
Imported from GitHub PR #24114 Converting FP8 <-> FP8 fails because the Triton compiler does not support it. The proposed fix will make the conversion go through FP16. Two questions: 1) Are there any better approaches of solving this? 2) I could not find a place to put unit tests for this, and in the code there is a comment saying: ``` // TODO(b/266862493): Add end-to-end test once FP8 support lands in XLA as // we can't test the code below without patching the feature. ``` Wondering if there is a place where I can add a test? ### Details When converting FP8 types, the XLA compiler emits a `fp_to_fp` Triton instruction. If the source type is FP8, no rounding strategy is specified. Concretely, this causes the following Triton to be emitted: <details> <summary> <code>%24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN></code> </summary> ``` module { tt.func @gemm_fusion_dot_320_impl(%arg0: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f8E5M2> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}) { %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf8E4M3FN> %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x64xf8E4M3FN> %c90_i32 = arith.constant 90 : i32 %c32000_i64 = arith.constant 32000 : i64 %c64_i32 = arith.constant 64 : i32 %c90_i64 = arith.constant 90 : i64 %c768_i64 = arith.constant 768 : i64 %c0_i32 = arith.constant 0 : i32 %c1_i64 = arith.constant 1 : i64 %c32_i32 = arith.constant 32 : i32 %c24_i32 = arith.constant 24 : i32 %c8_i32 = arith.constant 8 : i32 %c4000_i32 = arith.constant 4000 : i32 %cst_1 = arith.constant dense<0.000000e+00> : tensor<32x64xf32> %0 = tt.get_program_id x : i32 %1 = arith.divsi %0, %c4000_i32 : i32 %2 = arith.muli %1, %c8_i32 : i32 %3 = arith.subi %c24_i32, %2 : i32 %4 = arith.cmpi slt, %3, %c8_i32 : i32 %5 = arith.select %4, %3, %c8_i32 : i32 %6 = arith.remsi %0, %5 : i32 %7 = arith.addi %2, %6 : i32 %8 = arith.remsi %0, %c4000_i32 : i32 %9 = arith.divsi %8, %5 : i32 %10 = arith.muli %7, %c32_i32 : i32 %11 = tt.make_tensor_ptr %arg1, [%c768_i64, %c90_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E5M2>> %12 = tt.advance %11, [%10, %c0_i32] : <tensor<32x64xf8E5M2>> %13 = arith.muli %9, %c64_i32 : i32 %14 = tt.make_tensor_ptr %arg0, [%c90_i64, %c32000_i64], [%c1_i64, %c90_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x64xf8E4M3FN>> %15 = tt.advance %14, [%c0_i32, %13] : <tensor<64x64xf8E4M3FN>> %16:3 = scf.for %arg3 = %c0_i32 to %c90_i32 step %c64_i32 iter_args(%arg4 = %12, %arg5 = %15, %arg6 = %cst_1) -> (!tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32>) : i32 { %20 = tt.load %arg4 {boundaryCheck = array<i32: 1>, padding = 1 : i32} : !tt.ptr<tensor<32x64xf8E5M2>> %21 = tt.advance %arg4, [%c0_i32, %c64_i32] : <tensor<32x64xf8E5M2>> %22 = tt.load %arg5 {boundaryCheck = array<i32: 0>, padding = 1 : i32} : !tt.ptr<tensor<64x64xf8E4M3FN>> %23 = tt.advance %arg5, [%c64_i32, %c0_i32] : <tensor<64x64xf8E4M3FN>> %24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN> %25 = arith.subi %c90_i32, %arg3 : i32 %26 = arith.cmpi slt, %25, %c64_i32 : i32 %27 = scf.if %26 -> (tensor<32x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %32 = tt.splat %25 : i32 -> tensor<1x64xi32> %33 = arith.cmpi slt, %31, %32 : tensor<1x64xi32> %34 = tt.broadcast %33 : tensor<1x64xi1> -> tensor<32x64xi1> %35 = arith.select %34, %24, %cst_0 : tensor<32x64xi1>, tensor<32x64xf8E4M3FN> scf.yield %35 : tensor<32x64xf8E4M3FN> } else { scf.yield %24 : tensor<32x64xf8E4M3FN> } %28 = scf.if %26 -> (tensor<64x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %32 = tt.splat %25 : i32 -> tensor<64x1xi32> %33 = arith.cmpi slt, %31, %32 : tensor<64x1xi32> %34 = tt.broadcast %33 : tensor<64x1xi1> -> tensor<64x64xi1> %35 = arith.select %34, %22, %cst : tensor<64x64xi1>, tensor<64x64xf8E4M3FN> scf.yield %35 : tensor<64x64xf8E4M3FN> } else { scf.yield %22 : tensor<64x64xf8E4M3FN> } %29 = tt.dot %27, %28, %arg6, inputPrecision = tf32 {maxNumImpreciseAcc = 2147483647 : i32} : tensor<32x64xf8E4M3FN> * tensor<64x64xf8E4M3FN> -> tensor<32x64xf32> scf.yield %21, %23, %29 : !tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32> } %17 = tt.fp_to_fp %16#2, rounding = rtne : tensor<32x64xf32> -> tensor<32x64xf8E4M3FN> %18 = tt.make_tensor_ptr %arg2, [%c768_i64, %c32000_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E4M3FN>> %19 = tt.advance %18, [%10, %13] : <tensor<32x64xf8E4M3FN>> tt.store %19, %17 : !tt.ptr<tensor<32x64xf8E4M3FN>> tt.return } } ``` </details> Which leads to a failing assertion: ``` #0 0x000073413786d9fc in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000734137819476 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007341377ff7f3 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007341377ff71b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x0000734137810e96 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x000057d936b1777b in mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion::createDestOps (this=0x733d08425cc0, op=..., adaptor=..., rewriter=..., elemTy=..., operands=..., loc=...) at external/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ElementwiseOpToLLVM.cpp:500 #6 0x000057d936b17195 in mlir::triton::gpu::ElementwiseOpConversionBase<mlir::triton::FpToFpOp, mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion>::matchAndRewrite (this=0x733d08425cc0, op=..., adaptor=..., rewriter=...) at external/triton/include/triton/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVMBase.h:188 [...] #29 0x000057d93fa6cade in mlir::PassManager::run (this=0x733e80fba158, op=0x733d080bbc20) at external/llvm-project/mlir/lib/Pass/Pass.cpp:885 #30 0x000057d9363f6b1b in xla::gpu::CompileTritonToLLVM (hlo_config=..., hlo_module_name="gemm_fusion_dot.320", device_info=..., block_level_parameters=..., triton_module=..., llvm_module=0x733d0816d6a0, mlir_context=..., is_xla_fusion=true, emit_kernel=true) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1627 #31 0x000057d9363f5a5d in xla::gpu::TritonWrapper (fn_name="gemm_fusion_dot_320_impl", fusion=0x733d080a31c0, cc=std::variant<stream_executor::CudaComputeCapability, stream_executor::RocmComputeCapability> [index 0] = {...}, device_info=..., block_level_parameters=..., llvm_module=0x733d0816d6a0, mlir_context=...) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1531 ``` However, this fails Triton compilation: * First it hits an assertion that the rounding strategy when the destination type is FP8 must be specified * Adding the rounding strategy, then goes on to another issue, that no methods for converting FP8 <-> FP8 are specified To work around the above two issues, I propose going through FP16 when both the source and destination types are FP8's. Copybara import of the project: -- afd3929 by Kasper Nielsen <[email protected]>: Fix fused fp8 <-> fp8 conversions -- 66340aa by Kasper Nielsen <[email protected]>: Add unit tests and refactor duplicated code -- 07ae307 by Kasper Nielsen <[email protected]>: Run clang-format Merging this change closes #24114 FUTURE_COPYBARA_INTEGRATE_REVIEW=#24114 from kasper0406:kn/fp8-conversion-fix 07ae307 PiperOrigin-RevId: 741162069
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 27, 2025
Imported from GitHub PR #24114 Converting FP8 <-> FP8 fails because the Triton compiler does not support it. The proposed fix will make the conversion go through FP16. Two questions: 1) Are there any better approaches of solving this? 2) I could not find a place to put unit tests for this, and in the code there is a comment saying: ``` // TODO(b/266862493): Add end-to-end test once FP8 support lands in XLA as // we can't test the code below without patching the feature. ``` Wondering if there is a place where I can add a test? ### Details When converting FP8 types, the XLA compiler emits a `fp_to_fp` Triton instruction. If the source type is FP8, no rounding strategy is specified. Concretely, this causes the following Triton to be emitted: <details> <summary> <code>%24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN></code> </summary> ``` module { tt.func @gemm_fusion_dot_320_impl(%arg0: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f8E5M2> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}) { %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf8E4M3FN> %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x64xf8E4M3FN> %c90_i32 = arith.constant 90 : i32 %c32000_i64 = arith.constant 32000 : i64 %c64_i32 = arith.constant 64 : i32 %c90_i64 = arith.constant 90 : i64 %c768_i64 = arith.constant 768 : i64 %c0_i32 = arith.constant 0 : i32 %c1_i64 = arith.constant 1 : i64 %c32_i32 = arith.constant 32 : i32 %c24_i32 = arith.constant 24 : i32 %c8_i32 = arith.constant 8 : i32 %c4000_i32 = arith.constant 4000 : i32 %cst_1 = arith.constant dense<0.000000e+00> : tensor<32x64xf32> %0 = tt.get_program_id x : i32 %1 = arith.divsi %0, %c4000_i32 : i32 %2 = arith.muli %1, %c8_i32 : i32 %3 = arith.subi %c24_i32, %2 : i32 %4 = arith.cmpi slt, %3, %c8_i32 : i32 %5 = arith.select %4, %3, %c8_i32 : i32 %6 = arith.remsi %0, %5 : i32 %7 = arith.addi %2, %6 : i32 %8 = arith.remsi %0, %c4000_i32 : i32 %9 = arith.divsi %8, %5 : i32 %10 = arith.muli %7, %c32_i32 : i32 %11 = tt.make_tensor_ptr %arg1, [%c768_i64, %c90_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E5M2>> %12 = tt.advance %11, [%10, %c0_i32] : <tensor<32x64xf8E5M2>> %13 = arith.muli %9, %c64_i32 : i32 %14 = tt.make_tensor_ptr %arg0, [%c90_i64, %c32000_i64], [%c1_i64, %c90_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x64xf8E4M3FN>> %15 = tt.advance %14, [%c0_i32, %13] : <tensor<64x64xf8E4M3FN>> %16:3 = scf.for %arg3 = %c0_i32 to %c90_i32 step %c64_i32 iter_args(%arg4 = %12, %arg5 = %15, %arg6 = %cst_1) -> (!tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32>) : i32 { %20 = tt.load %arg4 {boundaryCheck = array<i32: 1>, padding = 1 : i32} : !tt.ptr<tensor<32x64xf8E5M2>> %21 = tt.advance %arg4, [%c0_i32, %c64_i32] : <tensor<32x64xf8E5M2>> %22 = tt.load %arg5 {boundaryCheck = array<i32: 0>, padding = 1 : i32} : !tt.ptr<tensor<64x64xf8E4M3FN>> %23 = tt.advance %arg5, [%c64_i32, %c0_i32] : <tensor<64x64xf8E4M3FN>> %24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN> %25 = arith.subi %c90_i32, %arg3 : i32 %26 = arith.cmpi slt, %25, %c64_i32 : i32 %27 = scf.if %26 -> (tensor<32x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %32 = tt.splat %25 : i32 -> tensor<1x64xi32> %33 = arith.cmpi slt, %31, %32 : tensor<1x64xi32> %34 = tt.broadcast %33 : tensor<1x64xi1> -> tensor<32x64xi1> %35 = arith.select %34, %24, %cst_0 : tensor<32x64xi1>, tensor<32x64xf8E4M3FN> scf.yield %35 : tensor<32x64xf8E4M3FN> } else { scf.yield %24 : tensor<32x64xf8E4M3FN> } %28 = scf.if %26 -> (tensor<64x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %32 = tt.splat %25 : i32 -> tensor<64x1xi32> %33 = arith.cmpi slt, %31, %32 : tensor<64x1xi32> %34 = tt.broadcast %33 : tensor<64x1xi1> -> tensor<64x64xi1> %35 = arith.select %34, %22, %cst : tensor<64x64xi1>, tensor<64x64xf8E4M3FN> scf.yield %35 : tensor<64x64xf8E4M3FN> } else { scf.yield %22 : tensor<64x64xf8E4M3FN> } %29 = tt.dot %27, %28, %arg6, inputPrecision = tf32 {maxNumImpreciseAcc = 2147483647 : i32} : tensor<32x64xf8E4M3FN> * tensor<64x64xf8E4M3FN> -> tensor<32x64xf32> scf.yield %21, %23, %29 : !tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32> } %17 = tt.fp_to_fp %16#2, rounding = rtne : tensor<32x64xf32> -> tensor<32x64xf8E4M3FN> %18 = tt.make_tensor_ptr %arg2, [%c768_i64, %c32000_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E4M3FN>> %19 = tt.advance %18, [%10, %13] : <tensor<32x64xf8E4M3FN>> tt.store %19, %17 : !tt.ptr<tensor<32x64xf8E4M3FN>> tt.return } } ``` </details> Which leads to a failing assertion: ``` #0 0x000073413786d9fc in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000734137819476 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007341377ff7f3 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007341377ff71b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x0000734137810e96 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x000057d936b1777b in mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion::createDestOps (this=0x733d08425cc0, op=..., adaptor=..., rewriter=..., elemTy=..., operands=..., loc=...) at external/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ElementwiseOpToLLVM.cpp:500 #6 0x000057d936b17195 in mlir::triton::gpu::ElementwiseOpConversionBase<mlir::triton::FpToFpOp, mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion>::matchAndRewrite (this=0x733d08425cc0, op=..., adaptor=..., rewriter=...) at external/triton/include/triton/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVMBase.h:188 [...] #29 0x000057d93fa6cade in mlir::PassManager::run (this=0x733e80fba158, op=0x733d080bbc20) at external/llvm-project/mlir/lib/Pass/Pass.cpp:885 #30 0x000057d9363f6b1b in xla::gpu::CompileTritonToLLVM (hlo_config=..., hlo_module_name="gemm_fusion_dot.320", device_info=..., block_level_parameters=..., triton_module=..., llvm_module=0x733d0816d6a0, mlir_context=..., is_xla_fusion=true, emit_kernel=true) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1627 #31 0x000057d9363f5a5d in xla::gpu::TritonWrapper (fn_name="gemm_fusion_dot_320_impl", fusion=0x733d080a31c0, cc=std::variant<stream_executor::CudaComputeCapability, stream_executor::RocmComputeCapability> [index 0] = {...}, device_info=..., block_level_parameters=..., llvm_module=0x733d0816d6a0, mlir_context=...) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1531 ``` However, this fails Triton compilation: * First it hits an assertion that the rounding strategy when the destination type is FP8 must be specified * Adding the rounding strategy, then goes on to another issue, that no methods for converting FP8 <-> FP8 are specified To work around the above two issues, I propose going through FP16 when both the source and destination types are FP8's. Copybara import of the project: -- afd3929 by Kasper Nielsen <[email protected]>: Fix fused fp8 <-> fp8 conversions -- 66340aa by Kasper Nielsen <[email protected]>: Add unit tests and refactor duplicated code -- 07ae307 by Kasper Nielsen <[email protected]>: Run clang-format Merging this change closes #24114 FUTURE_COPYBARA_INTEGRATE_REVIEW=#24114 from kasper0406:kn/fp8-conversion-fix 07ae307 PiperOrigin-RevId: 741162069
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 27, 2025
Imported from GitHub PR #24114 Converting FP8 <-> FP8 fails because the Triton compiler does not support it. The proposed fix will make the conversion go through FP16. Two questions: 1) Are there any better approaches of solving this? 2) I could not find a place to put unit tests for this, and in the code there is a comment saying: ``` // TODO(b/266862493): Add end-to-end test once FP8 support lands in XLA as // we can't test the code below without patching the feature. ``` Wondering if there is a place where I can add a test? ### Details When converting FP8 types, the XLA compiler emits a `fp_to_fp` Triton instruction. If the source type is FP8, no rounding strategy is specified. Concretely, this causes the following Triton to be emitted: <details> <summary> <code>%24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN></code> </summary> ``` module { tt.func @gemm_fusion_dot_320_impl(%arg0: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f8E5M2> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}) { %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf8E4M3FN> %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x64xf8E4M3FN> %c90_i32 = arith.constant 90 : i32 %c32000_i64 = arith.constant 32000 : i64 %c64_i32 = arith.constant 64 : i32 %c90_i64 = arith.constant 90 : i64 %c768_i64 = arith.constant 768 : i64 %c0_i32 = arith.constant 0 : i32 %c1_i64 = arith.constant 1 : i64 %c32_i32 = arith.constant 32 : i32 %c24_i32 = arith.constant 24 : i32 %c8_i32 = arith.constant 8 : i32 %c4000_i32 = arith.constant 4000 : i32 %cst_1 = arith.constant dense<0.000000e+00> : tensor<32x64xf32> %0 = tt.get_program_id x : i32 %1 = arith.divsi %0, %c4000_i32 : i32 %2 = arith.muli %1, %c8_i32 : i32 %3 = arith.subi %c24_i32, %2 : i32 %4 = arith.cmpi slt, %3, %c8_i32 : i32 %5 = arith.select %4, %3, %c8_i32 : i32 %6 = arith.remsi %0, %5 : i32 %7 = arith.addi %2, %6 : i32 %8 = arith.remsi %0, %c4000_i32 : i32 %9 = arith.divsi %8, %5 : i32 %10 = arith.muli %7, %c32_i32 : i32 %11 = tt.make_tensor_ptr %arg1, [%c768_i64, %c90_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E5M2>> %12 = tt.advance %11, [%10, %c0_i32] : <tensor<32x64xf8E5M2>> %13 = arith.muli %9, %c64_i32 : i32 %14 = tt.make_tensor_ptr %arg0, [%c90_i64, %c32000_i64], [%c1_i64, %c90_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x64xf8E4M3FN>> %15 = tt.advance %14, [%c0_i32, %13] : <tensor<64x64xf8E4M3FN>> %16:3 = scf.for %arg3 = %c0_i32 to %c90_i32 step %c64_i32 iter_args(%arg4 = %12, %arg5 = %15, %arg6 = %cst_1) -> (!tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32>) : i32 { %20 = tt.load %arg4 {boundaryCheck = array<i32: 1>, padding = 1 : i32} : !tt.ptr<tensor<32x64xf8E5M2>> %21 = tt.advance %arg4, [%c0_i32, %c64_i32] : <tensor<32x64xf8E5M2>> %22 = tt.load %arg5 {boundaryCheck = array<i32: 0>, padding = 1 : i32} : !tt.ptr<tensor<64x64xf8E4M3FN>> %23 = tt.advance %arg5, [%c64_i32, %c0_i32] : <tensor<64x64xf8E4M3FN>> %24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN> %25 = arith.subi %c90_i32, %arg3 : i32 %26 = arith.cmpi slt, %25, %c64_i32 : i32 %27 = scf.if %26 -> (tensor<32x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %32 = tt.splat %25 : i32 -> tensor<1x64xi32> %33 = arith.cmpi slt, %31, %32 : tensor<1x64xi32> %34 = tt.broadcast %33 : tensor<1x64xi1> -> tensor<32x64xi1> %35 = arith.select %34, %24, %cst_0 : tensor<32x64xi1>, tensor<32x64xf8E4M3FN> scf.yield %35 : tensor<32x64xf8E4M3FN> } else { scf.yield %24 : tensor<32x64xf8E4M3FN> } %28 = scf.if %26 -> (tensor<64x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %32 = tt.splat %25 : i32 -> tensor<64x1xi32> %33 = arith.cmpi slt, %31, %32 : tensor<64x1xi32> %34 = tt.broadcast %33 : tensor<64x1xi1> -> tensor<64x64xi1> %35 = arith.select %34, %22, %cst : tensor<64x64xi1>, tensor<64x64xf8E4M3FN> scf.yield %35 : tensor<64x64xf8E4M3FN> } else { scf.yield %22 : tensor<64x64xf8E4M3FN> } %29 = tt.dot %27, %28, %arg6, inputPrecision = tf32 {maxNumImpreciseAcc = 2147483647 : i32} : tensor<32x64xf8E4M3FN> * tensor<64x64xf8E4M3FN> -> tensor<32x64xf32> scf.yield %21, %23, %29 : !tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32> } %17 = tt.fp_to_fp %16#2, rounding = rtne : tensor<32x64xf32> -> tensor<32x64xf8E4M3FN> %18 = tt.make_tensor_ptr %arg2, [%c768_i64, %c32000_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E4M3FN>> %19 = tt.advance %18, [%10, %13] : <tensor<32x64xf8E4M3FN>> tt.store %19, %17 : !tt.ptr<tensor<32x64xf8E4M3FN>> tt.return } } ``` </details> Which leads to a failing assertion: ``` #0 0x000073413786d9fc in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000734137819476 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007341377ff7f3 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007341377ff71b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x0000734137810e96 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x000057d936b1777b in mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion::createDestOps (this=0x733d08425cc0, op=..., adaptor=..., rewriter=..., elemTy=..., operands=..., loc=...) at external/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ElementwiseOpToLLVM.cpp:500 #6 0x000057d936b17195 in mlir::triton::gpu::ElementwiseOpConversionBase<mlir::triton::FpToFpOp, mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion>::matchAndRewrite (this=0x733d08425cc0, op=..., adaptor=..., rewriter=...) at external/triton/include/triton/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVMBase.h:188 [...] #29 0x000057d93fa6cade in mlir::PassManager::run (this=0x733e80fba158, op=0x733d080bbc20) at external/llvm-project/mlir/lib/Pass/Pass.cpp:885 #30 0x000057d9363f6b1b in xla::gpu::CompileTritonToLLVM (hlo_config=..., hlo_module_name="gemm_fusion_dot.320", device_info=..., block_level_parameters=..., triton_module=..., llvm_module=0x733d0816d6a0, mlir_context=..., is_xla_fusion=true, emit_kernel=true) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1627 #31 0x000057d9363f5a5d in xla::gpu::TritonWrapper (fn_name="gemm_fusion_dot_320_impl", fusion=0x733d080a31c0, cc=std::variant<stream_executor::CudaComputeCapability, stream_executor::RocmComputeCapability> [index 0] = {...}, device_info=..., block_level_parameters=..., llvm_module=0x733d0816d6a0, mlir_context=...) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1531 ``` However, this fails Triton compilation: * First it hits an assertion that the rounding strategy when the destination type is FP8 must be specified * Adding the rounding strategy, then goes on to another issue, that no methods for converting FP8 <-> FP8 are specified To work around the above two issues, I propose going through FP16 when both the source and destination types are FP8's. Copybara import of the project: -- afd3929 by Kasper Nielsen <[email protected]>: Fix fused fp8 <-> fp8 conversions -- 66340aa by Kasper Nielsen <[email protected]>: Add unit tests and refactor duplicated code -- 07ae307 by Kasper Nielsen <[email protected]>: Run clang-format Merging this change closes #24114 FUTURE_COPYBARA_INTEGRATE_REVIEW=#24114 from kasper0406:kn/fp8-conversion-fix 07ae307 PiperOrigin-RevId: 741162069
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 28, 2025
Imported from GitHub PR #24114 Converting FP8 <-> FP8 fails because the Triton compiler does not support it. The proposed fix will make the conversion go through FP16. Two questions: 1) Are there any better approaches of solving this? 2) I could not find a place to put unit tests for this, and in the code there is a comment saying: ``` // TODO(b/266862493): Add end-to-end test once FP8 support lands in XLA as // we can't test the code below without patching the feature. ``` Wondering if there is a place where I can add a test? ### Details When converting FP8 types, the XLA compiler emits a `fp_to_fp` Triton instruction. If the source type is FP8, no rounding strategy is specified. Concretely, this causes the following Triton to be emitted: <details> <summary> <code>%24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN></code> </summary> ``` module { tt.func @gemm_fusion_dot_320_impl(%arg0: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f8E5M2> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}) { %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf8E4M3FN> %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x64xf8E4M3FN> %c90_i32 = arith.constant 90 : i32 %c32000_i64 = arith.constant 32000 : i64 %c64_i32 = arith.constant 64 : i32 %c90_i64 = arith.constant 90 : i64 %c768_i64 = arith.constant 768 : i64 %c0_i32 = arith.constant 0 : i32 %c1_i64 = arith.constant 1 : i64 %c32_i32 = arith.constant 32 : i32 %c24_i32 = arith.constant 24 : i32 %c8_i32 = arith.constant 8 : i32 %c4000_i32 = arith.constant 4000 : i32 %cst_1 = arith.constant dense<0.000000e+00> : tensor<32x64xf32> %0 = tt.get_program_id x : i32 %1 = arith.divsi %0, %c4000_i32 : i32 %2 = arith.muli %1, %c8_i32 : i32 %3 = arith.subi %c24_i32, %2 : i32 %4 = arith.cmpi slt, %3, %c8_i32 : i32 %5 = arith.select %4, %3, %c8_i32 : i32 %6 = arith.remsi %0, %5 : i32 %7 = arith.addi %2, %6 : i32 %8 = arith.remsi %0, %c4000_i32 : i32 %9 = arith.divsi %8, %5 : i32 %10 = arith.muli %7, %c32_i32 : i32 %11 = tt.make_tensor_ptr %arg1, [%c768_i64, %c90_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E5M2>> %12 = tt.advance %11, [%10, %c0_i32] : <tensor<32x64xf8E5M2>> %13 = arith.muli %9, %c64_i32 : i32 %14 = tt.make_tensor_ptr %arg0, [%c90_i64, %c32000_i64], [%c1_i64, %c90_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x64xf8E4M3FN>> %15 = tt.advance %14, [%c0_i32, %13] : <tensor<64x64xf8E4M3FN>> %16:3 = scf.for %arg3 = %c0_i32 to %c90_i32 step %c64_i32 iter_args(%arg4 = %12, %arg5 = %15, %arg6 = %cst_1) -> (!tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32>) : i32 { %20 = tt.load %arg4 {boundaryCheck = array<i32: 1>, padding = 1 : i32} : !tt.ptr<tensor<32x64xf8E5M2>> %21 = tt.advance %arg4, [%c0_i32, %c64_i32] : <tensor<32x64xf8E5M2>> %22 = tt.load %arg5 {boundaryCheck = array<i32: 0>, padding = 1 : i32} : !tt.ptr<tensor<64x64xf8E4M3FN>> %23 = tt.advance %arg5, [%c64_i32, %c0_i32] : <tensor<64x64xf8E4M3FN>> %24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN> %25 = arith.subi %c90_i32, %arg3 : i32 %26 = arith.cmpi slt, %25, %c64_i32 : i32 %27 = scf.if %26 -> (tensor<32x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %32 = tt.splat %25 : i32 -> tensor<1x64xi32> %33 = arith.cmpi slt, %31, %32 : tensor<1x64xi32> %34 = tt.broadcast %33 : tensor<1x64xi1> -> tensor<32x64xi1> %35 = arith.select %34, %24, %cst_0 : tensor<32x64xi1>, tensor<32x64xf8E4M3FN> scf.yield %35 : tensor<32x64xf8E4M3FN> } else { scf.yield %24 : tensor<32x64xf8E4M3FN> } %28 = scf.if %26 -> (tensor<64x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %32 = tt.splat %25 : i32 -> tensor<64x1xi32> %33 = arith.cmpi slt, %31, %32 : tensor<64x1xi32> %34 = tt.broadcast %33 : tensor<64x1xi1> -> tensor<64x64xi1> %35 = arith.select %34, %22, %cst : tensor<64x64xi1>, tensor<64x64xf8E4M3FN> scf.yield %35 : tensor<64x64xf8E4M3FN> } else { scf.yield %22 : tensor<64x64xf8E4M3FN> } %29 = tt.dot %27, %28, %arg6, inputPrecision = tf32 {maxNumImpreciseAcc = 2147483647 : i32} : tensor<32x64xf8E4M3FN> * tensor<64x64xf8E4M3FN> -> tensor<32x64xf32> scf.yield %21, %23, %29 : !tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32> } %17 = tt.fp_to_fp %16#2, rounding = rtne : tensor<32x64xf32> -> tensor<32x64xf8E4M3FN> %18 = tt.make_tensor_ptr %arg2, [%c768_i64, %c32000_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E4M3FN>> %19 = tt.advance %18, [%10, %13] : <tensor<32x64xf8E4M3FN>> tt.store %19, %17 : !tt.ptr<tensor<32x64xf8E4M3FN>> tt.return } } ``` </details> Which leads to a failing assertion: ``` #0 0x000073413786d9fc in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000734137819476 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007341377ff7f3 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007341377ff71b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x0000734137810e96 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x000057d936b1777b in mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion::createDestOps (this=0x733d08425cc0, op=..., adaptor=..., rewriter=..., elemTy=..., operands=..., loc=...) at external/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ElementwiseOpToLLVM.cpp:500 #6 0x000057d936b17195 in mlir::triton::gpu::ElementwiseOpConversionBase<mlir::triton::FpToFpOp, mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion>::matchAndRewrite (this=0x733d08425cc0, op=..., adaptor=..., rewriter=...) at external/triton/include/triton/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVMBase.h:188 [...] #29 0x000057d93fa6cade in mlir::PassManager::run (this=0x733e80fba158, op=0x733d080bbc20) at external/llvm-project/mlir/lib/Pass/Pass.cpp:885 #30 0x000057d9363f6b1b in xla::gpu::CompileTritonToLLVM (hlo_config=..., hlo_module_name="gemm_fusion_dot.320", device_info=..., block_level_parameters=..., triton_module=..., llvm_module=0x733d0816d6a0, mlir_context=..., is_xla_fusion=true, emit_kernel=true) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1627 #31 0x000057d9363f5a5d in xla::gpu::TritonWrapper (fn_name="gemm_fusion_dot_320_impl", fusion=0x733d080a31c0, cc=std::variant<stream_executor::CudaComputeCapability, stream_executor::RocmComputeCapability> [index 0] = {...}, device_info=..., block_level_parameters=..., llvm_module=0x733d0816d6a0, mlir_context=...) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1531 ``` However, this fails Triton compilation: * First it hits an assertion that the rounding strategy when the destination type is FP8 must be specified * Adding the rounding strategy, then goes on to another issue, that no methods for converting FP8 <-> FP8 are specified To work around the above two issues, I propose going through FP16 when both the source and destination types are FP8's. Copybara import of the project: -- afd3929 by Kasper Nielsen <[email protected]>: Fix fused fp8 <-> fp8 conversions -- 66340aa by Kasper Nielsen <[email protected]>: Add unit tests and refactor duplicated code -- 07ae307 by Kasper Nielsen <[email protected]>: Run clang-format -- fe967ff by Kasper Nielsen <[email protected]>: Fix support conversion tests Merging this change closes #24114 FUTURE_COPYBARA_INTEGRATE_REVIEW=#24114 from kasper0406:kn/fp8-conversion-fix fe967ff PiperOrigin-RevId: 741162069
copybara-service bot
pushed a commit
that referenced
this pull request
Mar 28, 2025
Imported from GitHub PR #24114 Converting FP8 <-> FP8 fails because the Triton compiler does not support it. The proposed fix will make the conversion go through FP16. Two questions: 1) Are there any better approaches of solving this? 2) I could not find a place to put unit tests for this, and in the code there is a comment saying: ``` // TODO(b/266862493): Add end-to-end test once FP8 support lands in XLA as // we can't test the code below without patching the feature. ``` Wondering if there is a place where I can add a test? ### Details When converting FP8 types, the XLA compiler emits a `fp_to_fp` Triton instruction. If the source type is FP8, no rounding strategy is specified. Concretely, this causes the following Triton to be emitted: <details> <summary> <code>%24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN></code> </summary> ``` module { tt.func @gemm_fusion_dot_320_impl(%arg0: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f8E5M2> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f8E4M3FN> {tt.divisibility = 16 : i32}) { %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf8E4M3FN> %cst_0 = arith.constant dense<0.000000e+00> : tensor<32x64xf8E4M3FN> %c90_i32 = arith.constant 90 : i32 %c32000_i64 = arith.constant 32000 : i64 %c64_i32 = arith.constant 64 : i32 %c90_i64 = arith.constant 90 : i64 %c768_i64 = arith.constant 768 : i64 %c0_i32 = arith.constant 0 : i32 %c1_i64 = arith.constant 1 : i64 %c32_i32 = arith.constant 32 : i32 %c24_i32 = arith.constant 24 : i32 %c8_i32 = arith.constant 8 : i32 %c4000_i32 = arith.constant 4000 : i32 %cst_1 = arith.constant dense<0.000000e+00> : tensor<32x64xf32> %0 = tt.get_program_id x : i32 %1 = arith.divsi %0, %c4000_i32 : i32 %2 = arith.muli %1, %c8_i32 : i32 %3 = arith.subi %c24_i32, %2 : i32 %4 = arith.cmpi slt, %3, %c8_i32 : i32 %5 = arith.select %4, %3, %c8_i32 : i32 %6 = arith.remsi %0, %5 : i32 %7 = arith.addi %2, %6 : i32 %8 = arith.remsi %0, %c4000_i32 : i32 %9 = arith.divsi %8, %5 : i32 %10 = arith.muli %7, %c32_i32 : i32 %11 = tt.make_tensor_ptr %arg1, [%c768_i64, %c90_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E5M2>> %12 = tt.advance %11, [%10, %c0_i32] : <tensor<32x64xf8E5M2>> %13 = arith.muli %9, %c64_i32 : i32 %14 = tt.make_tensor_ptr %arg0, [%c90_i64, %c32000_i64], [%c1_i64, %c90_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x64xf8E4M3FN>> %15 = tt.advance %14, [%c0_i32, %13] : <tensor<64x64xf8E4M3FN>> %16:3 = scf.for %arg3 = %c0_i32 to %c90_i32 step %c64_i32 iter_args(%arg4 = %12, %arg5 = %15, %arg6 = %cst_1) -> (!tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32>) : i32 { %20 = tt.load %arg4 {boundaryCheck = array<i32: 1>, padding = 1 : i32} : !tt.ptr<tensor<32x64xf8E5M2>> %21 = tt.advance %arg4, [%c0_i32, %c64_i32] : <tensor<32x64xf8E5M2>> %22 = tt.load %arg5 {boundaryCheck = array<i32: 0>, padding = 1 : i32} : !tt.ptr<tensor<64x64xf8E4M3FN>> %23 = tt.advance %arg5, [%c64_i32, %c0_i32] : <tensor<64x64xf8E4M3FN>> %24 = tt.fp_to_fp %20 : tensor<32x64xf8E5M2> -> tensor<32x64xf8E4M3FN> %25 = arith.subi %c90_i32, %arg3 : i32 %26 = arith.cmpi slt, %25, %c64_i32 : i32 %27 = scf.if %26 -> (tensor<32x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32> %32 = tt.splat %25 : i32 -> tensor<1x64xi32> %33 = arith.cmpi slt, %31, %32 : tensor<1x64xi32> %34 = tt.broadcast %33 : tensor<1x64xi1> -> tensor<32x64xi1> %35 = arith.select %34, %24, %cst_0 : tensor<32x64xi1>, tensor<32x64xf8E4M3FN> scf.yield %35 : tensor<32x64xf8E4M3FN> } else { scf.yield %24 : tensor<32x64xf8E4M3FN> } %28 = scf.if %26 -> (tensor<64x64xf8E4M3FN>) { %30 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32> %32 = tt.splat %25 : i32 -> tensor<64x1xi32> %33 = arith.cmpi slt, %31, %32 : tensor<64x1xi32> %34 = tt.broadcast %33 : tensor<64x1xi1> -> tensor<64x64xi1> %35 = arith.select %34, %22, %cst : tensor<64x64xi1>, tensor<64x64xf8E4M3FN> scf.yield %35 : tensor<64x64xf8E4M3FN> } else { scf.yield %22 : tensor<64x64xf8E4M3FN> } %29 = tt.dot %27, %28, %arg6, inputPrecision = tf32 {maxNumImpreciseAcc = 2147483647 : i32} : tensor<32x64xf8E4M3FN> * tensor<64x64xf8E4M3FN> -> tensor<32x64xf32> scf.yield %21, %23, %29 : !tt.ptr<tensor<32x64xf8E5M2>>, !tt.ptr<tensor<64x64xf8E4M3FN>>, tensor<32x64xf32> } %17 = tt.fp_to_fp %16#2, rounding = rtne : tensor<32x64xf32> -> tensor<32x64xf8E4M3FN> %18 = tt.make_tensor_ptr %arg2, [%c768_i64, %c32000_i64], [%c1_i64, %c768_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<32x64xf8E4M3FN>> %19 = tt.advance %18, [%10, %13] : <tensor<32x64xf8E4M3FN>> tt.store %19, %17 : !tt.ptr<tensor<32x64xf8E4M3FN>> tt.return } } ``` </details> Which leads to a failing assertion: ``` #0 0x000073413786d9fc in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000734137819476 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007341377ff7f3 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007341377ff71b in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x0000734137810e96 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x000057d936b1777b in mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion::createDestOps (this=0x733d08425cc0, op=..., adaptor=..., rewriter=..., elemTy=..., operands=..., loc=...) at external/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ElementwiseOpToLLVM.cpp:500 #6 0x000057d936b17195 in mlir::triton::gpu::ElementwiseOpConversionBase<mlir::triton::FpToFpOp, mlir::triton::gpu::(anonymous namespace)::FpToFpOpConversion>::matchAndRewrite (this=0x733d08425cc0, op=..., adaptor=..., rewriter=...) at external/triton/include/triton/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVMBase.h:188 [...] #29 0x000057d93fa6cade in mlir::PassManager::run (this=0x733e80fba158, op=0x733d080bbc20) at external/llvm-project/mlir/lib/Pass/Pass.cpp:885 #30 0x000057d9363f6b1b in xla::gpu::CompileTritonToLLVM (hlo_config=..., hlo_module_name="gemm_fusion_dot.320", device_info=..., block_level_parameters=..., triton_module=..., llvm_module=0x733d0816d6a0, mlir_context=..., is_xla_fusion=true, emit_kernel=true) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1627 #31 0x000057d9363f5a5d in xla::gpu::TritonWrapper (fn_name="gemm_fusion_dot_320_impl", fusion=0x733d080a31c0, cc=std::variant<stream_executor::CudaComputeCapability, stream_executor::RocmComputeCapability> [index 0] = {...}, device_info=..., block_level_parameters=..., llvm_module=0x733d0816d6a0, mlir_context=...) at xla/backends/gpu/codegen/triton/fusion_emitter.cc:1531 ``` However, this fails Triton compilation: * First it hits an assertion that the rounding strategy when the destination type is FP8 must be specified * Adding the rounding strategy, then goes on to another issue, that no methods for converting FP8 <-> FP8 are specified To work around the above two issues, I propose going through FP16 when both the source and destination types are FP8's. Copybara import of the project: -- afd3929 by Kasper Nielsen <[email protected]>: Fix fused fp8 <-> fp8 conversions -- 66340aa by Kasper Nielsen <[email protected]>: Add unit tests and refactor duplicated code -- 07ae307 by Kasper Nielsen <[email protected]>: Run clang-format -- fe967ff by Kasper Nielsen <[email protected]>: Fix support conversion tests Merging this change closes #24114 COPYBARA_INTEGRATE_REVIEW=#24114 from kasper0406:kn/fp8-conversion-fix fe967ff PiperOrigin-RevId: 741473648
copybara-service bot
pushed a commit
that referenced
this pull request
Apr 9, 2025
… kernel rocm cu Imported from GitHub PR #24898 Fix issue reported by asan while running the tests on rocm ci: ``` ==1718600==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x5030001d97f8 at pc 0x5647cfdda211 bp 0x7ffc9eb7eac0 sp 0x7ffc9eb7eab8 READ of size 8 at 0x5030001d97f8 thread T0 #0 0x5647cfdda210 in absl::lts_20230802::container_internal::CommonFields::capacity() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:990:36 #1 0x5647cfdda210 in void absl::lts_20230802::container_internal::InitializeSlots<std::allocator<char>, 8ul, 8ul>(absl::lts_20230802::container_internal::CommonFields&, std::allocator<char>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1403:24 #2 0x7f066c2cfdde in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::resize(unsigned long) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9dde) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #3 0x7f066c2cfd97 in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::prepare_insert(unsigned long) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9d97) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #4 0x7f066c2cfcca in std::pair<unsigned long, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::find_or_prepare_insert<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>(std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const&) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9cca) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #5 0x7f066c2cf9c4 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::EmplaceDecomposable::operator()<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, std::piecewise_construct_t const&, std::tuple<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>&>, std::tuple<stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>&&>>(std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const&, std::piecewise_construct_t const&, std::tuple<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>&>&&, std::tuple<stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>&&>&&) const (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x99c4) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #6 0x7f066c2cf0ad in stream_executor::GetComparisonKernel(stream_executor::StreamExecutor*, stream_executor::GpuAsmOpts) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x90ad) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #7 0x7f066c37ba93 in stream_executor::RedzoneAllocator::CheckRedzones() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/stream_executor/gpu/redzone_allocator.cc:272:3 #8 0x7f06b31bb7e9 in absl::lts_20230802::StatusOr<xla::AutotuneResult> xla::gpu::(anonymous namespace)::GemmAutotuner::GetBestAlgorithm<long, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&>(xla::HloInstruction const*, absl::lts_20230802::Span<long const>, double, bool, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:328:7 #9 0x7f06b31bb7e9 in xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:256:12 #10 0x7f06b31bb7e9 in xla::gpu::(anonymous namespace)::GemmAutotuner::operator()(xla::HloInstruction const*, xla::gpu::AutotuneCacheKey const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:137:18 #11 0x7f06b31b6760 in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0::operator()() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #12 0x7f06b31b6760 in absl::lts_20230802::StatusOr<xla::AutotuneResult> std::__invoke_impl<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(std::__invoke_other, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 #13 0x7f06b31b6760 in std::enable_if<is_invocable_r_v<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>, absl::lts_20230802::StatusOr<xla::AutotuneResult>>::type std::__invoke_r<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:114:9 #14 0x7f06b31b6760 in std::_Function_handler<absl::lts_20230802::StatusOr<xla::AutotuneResult> (), xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0>::_M_invoke(std::_Any_data const&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:290:9 #15 0x7f06b308670d in std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()>::operator()() const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:590:9 #16 0x7f06b308670d in xla::gpu::AutotunerUtil::Autotune(xla::HloInstruction const*, xla::gpu::AutotuneConfig const&, std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/autotuner_util.cc:460:3 #17 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #18 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnComputation(xla::HloComputation*, xla::gpu::(anonymous namespace)::GemmAutotuner&, unsigned long*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:468:7 #19 0x7f06b31b336e in xla::gpu::GemmAlgorithmPicker::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:495:5 #20 0x7f06b30242f3 in xla::HloPassPipeline::RunHelper(xla::HloPassInterface*, xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_pipeline.h:150:5 #21 0x7f06b3010bb9 in absl::lts_20230802::StatusOr<bool> xla::HloPassPipeline::RunPassesInternal<xla::HloModule>(xla::HloModule*, xla::DebugOptions const&, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:198:30 #22 0x7f06b300f786 in xla::HloPassPipeline::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:338:10 #23 0x5647cfd66945 in xla::HloPassInterface::Run(xla::HloModule*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_interface.h:85:12 #24 0x7f06c2908be0 in xla::gpu::GpuCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1754:3 #25 0x7f06c2a000f3 in xla::gpu::AMDGPUCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/amdgpu_compiler.cc:197:3 #26 0x7f06c28f85e9 in xla::gpu::GpuCompiler::OptimizeHloModule(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1392:3 #27 0x7f06c291250d in xla::gpu::GpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1824:3 #28 0x5647cfd63784 in xla::Compiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, stream_executor::DeviceMemoryAllocator*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/service/compiler.h:177:12 #29 0x7f06c339acba in xla::HloTestBase::GetOptimizedModule(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/hlo_test_base.cc:188:32 #30 0x5647cfd89516 in xla::gpu::(anonymous namespace)::GpuCompilerTest_CollectivePermuteDecompositionAndPipelining_Test::TestBody() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler_test.cc:879:3 #31 0x7f06c2c649dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #32 0x7f06c2c649dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #33 0x7f06c2c64708 in testing::Test::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5 #34 0x7f06c2c6771b in testing::TestInfo::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11 #35 0x7f06c2c6a5ab in testing::TestSuite::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30 #36 0x7f06c2c96eba in testing::internal::UnitTestImpl::RunAllTests() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44 #37 0x7f06c2c9579d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #38 0x7f06c2c9579d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #39 0x7f06c2c95203 in testing::UnitTest::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10 #40 0x7f06c2d679b8 in RUN_ALL_TESTS() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73 #41 0x7f06c2d679b8 in main /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10 #42 0x7f064c0b3d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #43 0x7f064c0b3e3f in __libc_start_main csu/../csu/libc-start.c:392:3 #44 0x5647cfc7b044 in _start (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/gpu_compiler_test_gpu_amd_any+0xff044) (BuildId: ef1ac485eb61840d0e2233a2cca69eec) 0x5030001d97f8 is located 8 bytes before 32-byte region [0x5030001d9800,0x5030001d9820) allocated by thread T0 here: #0 0x5647cfd1527f in malloc (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/gpu_compiler_test_gpu_amd_any+0x19927f) (BuildId: ef1ac485eb61840d0e2233a2cca69eec) #1 0x7f064c39798b in operator new(unsigned long) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xae98b) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2) #2 0x7f06b31bb5b7 in google::protobuf::Duration* google::protobuf::MessageLite::CreateMaybeMessage<google::protobuf::Duration>(google::protobuf::Arena*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_protobuf/src/google/protobuf/message_lite.h:425:12 #3 0x7f06b31bb5b7 in xla::AutotuneResult::_internal_mutable_run_time() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/autotuning.pb.h:3079:15 #4 0x7f06b31bb5b7 in xla::AutotuneResult::mutable_run_time() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/autotuning.pb.h:3085:45 #5 0x7f06b31bb5b7 in absl::lts_20230802::StatusOr<xla::AutotuneResult> xla::gpu::(anonymous namespace)::GemmAutotuner::GetBestAlgorithm<long, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&>(xla::HloInstruction const*, absl::lts_20230802::Span<long const>, double, bool, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:321:15 #6 0x7f06b31bb5b7 in xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:256:12 #7 0x7f06b31bb5b7 in xla::gpu::(anonymous namespace)::GemmAutotuner::operator()(xla::HloInstruction const*, xla::gpu::AutotuneCacheKey const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:137:18 #8 0x7f06b31b6760 in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0::operator()() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #9 0x7f06b31b6760 in absl::lts_20230802::StatusOr<xla::AutotuneResult> std::__invoke_impl<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(std::__invoke_other, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 #10 0x7f06b31b6760 in std::enable_if<is_invocable_r_v<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>, absl::lts_20230802::StatusOr<xla::AutotuneResult>>::type std::__invoke_r<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:114:9 #11 0x7f06b31b6760 in std::_Function_handler<absl::lts_20230802::StatusOr<xla::AutotuneResult> (), xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0>::_M_invoke(std::_Any_data const&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:290:9 #12 0x7f06b308670d in std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()>::operator()() const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:590:9 #13 0x7f06b308670d in xla::gpu::AutotunerUtil::Autotune(xla::HloInstruction const*, xla::gpu::AutotuneConfig const&, std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/autotuner_util.cc:460:3 #14 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #15 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnComputation(xla::HloComputation*, xla::gpu::(anonymous namespace)::GemmAutotuner&, unsigned long*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:468:7 #16 0x7f06b31b336e in xla::gpu::GemmAlgorithmPicker::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:495:5 #17 0x7f06b30242f3 in xla::HloPassPipeline::RunHelper(xla::HloPassInterface*, xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_pipeline.h:150:5 #18 0x7f06b3010bb9 in absl::lts_20230802::StatusOr<bool> xla::HloPassPipeline::RunPassesInternal<xla::HloModule>(xla::HloModule*, xla::DebugOptions const&, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:198:30 #19 0x7f06b300f786 in xla::HloPassPipeline::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:338:10 #20 0x5647cfd66945 in xla::HloPassInterface::Run(xla::HloModule*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_interface.h:85:12 #21 0x7f06c2908be0 in xla::gpu::GpuCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1754:3 #22 0x7f06c2a000f3 in xla::gpu::AMDGPUCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/amdgpu_compiler.cc:197:3 #23 0x7f06c28f85e9 in xla::gpu::GpuCompiler::OptimizeHloModule(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1392:3 #24 0x7f06c291250d in xla::gpu::GpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1824:3 #25 0x5647cfd63784 in xla::Compiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, stream_executor::DeviceMemoryAllocator*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/service/compiler.h:177:12 #26 0x7f06c339acba in xla::HloTestBase::GetOptimizedModule(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/hlo_test_base.cc:188:32 #27 0x5647cfd89516 in xla::gpu::(anonymous namespace)::GpuCompilerTest_CollectivePermuteDecompositionAndPipelining_Test::TestBody() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler_test.cc:879:3 #28 0x7f06c2c649dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #29 0x7f06c2c649dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #30 0x7f06c2c64708 in testing::Test::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5 #31 0x7f06c2c6771b in testing::TestInfo::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11 #32 0x7f06c2c6a5ab in testing::TestSuite::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30 #33 0x7f06c2c96eba in testing::internal::UnitTestImpl::RunAllTests() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44 #34 0x7f06c2c9579d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #35 0x7f06c2c9579d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #36 0x7f06c2c95203 in testing::UnitTest::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10 #37 0x7f06c2d679b8 in RUN_ALL_TESTS() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73 #38 0x7f06c2d679b8 in main /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10 #39 0x7f064c0b3d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 SUMMARY: AddressSanitizer: heap-buffer-overflow /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:990:36 in absl::lts_20230802::container_internal::CommonFields::capacity() const Shadow bytes around the buggy address: 0x5030001d9500: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa 0x5030001d9580: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd 0x5030001d9600: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa 0x5030001d9680: fd fd fd fd fa fa fd fd fd fa fa fa fd fd fd fa 0x5030001d9700: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd =>0x5030001d9780: fd fa fa fa 00 00 00 fa fa fa 00 00 00 00 fa[fa] 0x5030001d9800: 00 00 00 00 fa fa 00 00 00 00 fa fa fd fd fd fd 0x5030001d9880: fa fa fd fd fd fd fa fa fd fd fd fa fa fa fd fd 0x5030001d9900: fd fd fa fa fd fd fd fd fa fa fd fd fd fd fa fa 0x5030001d9980: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa 0x5030001d9a00: fa fa fd fd fd fa fa fa fd fd fd fd fa fa fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==1718600==ABORTING ``` Copybara import of the project: -- 9a75d26 by alekstheod <[email protected]>: Fix access memory asan issue in redzone_allocator_kernel_rocm.cu Merging this change closes #24898 FUTURE_COPYBARA_INTEGRATE_REVIEW=#24898 from ROCm:ci_fix_asan_invalid_memory_access_in_redzone_allocator_kernel_rocm_cu 9a75d26 PiperOrigin-RevId: 745536108
copybara-service bot
pushed a commit
that referenced
this pull request
Apr 9, 2025
… kernel rocm cu Imported from GitHub PR #24898 Fix issue reported by asan while running the tests on rocm ci: ``` ==1718600==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x5030001d97f8 at pc 0x5647cfdda211 bp 0x7ffc9eb7eac0 sp 0x7ffc9eb7eab8 READ of size 8 at 0x5030001d97f8 thread T0 #0 0x5647cfdda210 in absl::lts_20230802::container_internal::CommonFields::capacity() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:990:36 #1 0x5647cfdda210 in void absl::lts_20230802::container_internal::InitializeSlots<std::allocator<char>, 8ul, 8ul>(absl::lts_20230802::container_internal::CommonFields&, std::allocator<char>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1403:24 #2 0x7f066c2cfdde in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::resize(unsigned long) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9dde) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #3 0x7f066c2cfd97 in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::prepare_insert(unsigned long) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9d97) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #4 0x7f066c2cfcca in std::pair<unsigned long, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::find_or_prepare_insert<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>(std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const&) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9cca) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #5 0x7f066c2cf9c4 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::EmplaceDecomposable::operator()<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, std::piecewise_construct_t const&, std::tuple<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>&>, std::tuple<stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>&&>>(std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const&, std::piecewise_construct_t const&, std::tuple<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>&>&&, std::tuple<stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>&&>&&) const (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x99c4) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #6 0x7f066c2cf0ad in stream_executor::GetComparisonKernel(stream_executor::StreamExecutor*, stream_executor::GpuAsmOpts) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x90ad) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #7 0x7f066c37ba93 in stream_executor::RedzoneAllocator::CheckRedzones() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/stream_executor/gpu/redzone_allocator.cc:272:3 #8 0x7f06b31bb7e9 in absl::lts_20230802::StatusOr<xla::AutotuneResult> xla::gpu::(anonymous namespace)::GemmAutotuner::GetBestAlgorithm<long, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&>(xla::HloInstruction const*, absl::lts_20230802::Span<long const>, double, bool, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:328:7 #9 0x7f06b31bb7e9 in xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:256:12 #10 0x7f06b31bb7e9 in xla::gpu::(anonymous namespace)::GemmAutotuner::operator()(xla::HloInstruction const*, xla::gpu::AutotuneCacheKey const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:137:18 #11 0x7f06b31b6760 in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0::operator()() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #12 0x7f06b31b6760 in absl::lts_20230802::StatusOr<xla::AutotuneResult> std::__invoke_impl<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(std::__invoke_other, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 #13 0x7f06b31b6760 in std::enable_if<is_invocable_r_v<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>, absl::lts_20230802::StatusOr<xla::AutotuneResult>>::type std::__invoke_r<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:114:9 #14 0x7f06b31b6760 in std::_Function_handler<absl::lts_20230802::StatusOr<xla::AutotuneResult> (), xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0>::_M_invoke(std::_Any_data const&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:290:9 #15 0x7f06b308670d in std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()>::operator()() const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:590:9 #16 0x7f06b308670d in xla::gpu::AutotunerUtil::Autotune(xla::HloInstruction const*, xla::gpu::AutotuneConfig const&, std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/autotuner_util.cc:460:3 #17 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #18 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnComputation(xla::HloComputation*, xla::gpu::(anonymous namespace)::GemmAutotuner&, unsigned long*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:468:7 #19 0x7f06b31b336e in xla::gpu::GemmAlgorithmPicker::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:495:5 #20 0x7f06b30242f3 in xla::HloPassPipeline::RunHelper(xla::HloPassInterface*, xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_pipeline.h:150:5 #21 0x7f06b3010bb9 in absl::lts_20230802::StatusOr<bool> xla::HloPassPipeline::RunPassesInternal<xla::HloModule>(xla::HloModule*, xla::DebugOptions const&, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:198:30 #22 0x7f06b300f786 in xla::HloPassPipeline::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:338:10 #23 0x5647cfd66945 in xla::HloPassInterface::Run(xla::HloModule*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_interface.h:85:12 #24 0x7f06c2908be0 in xla::gpu::GpuCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1754:3 #25 0x7f06c2a000f3 in xla::gpu::AMDGPUCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/amdgpu_compiler.cc:197:3 #26 0x7f06c28f85e9 in xla::gpu::GpuCompiler::OptimizeHloModule(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1392:3 #27 0x7f06c291250d in xla::gpu::GpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1824:3 #28 0x5647cfd63784 in xla::Compiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, stream_executor::DeviceMemoryAllocator*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/service/compiler.h:177:12 #29 0x7f06c339acba in xla::HloTestBase::GetOptimizedModule(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/hlo_test_base.cc:188:32 #30 0x5647cfd89516 in xla::gpu::(anonymous namespace)::GpuCompilerTest_CollectivePermuteDecompositionAndPipelining_Test::TestBody() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler_test.cc:879:3 #31 0x7f06c2c649dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #32 0x7f06c2c649dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #33 0x7f06c2c64708 in testing::Test::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5 #34 0x7f06c2c6771b in testing::TestInfo::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11 #35 0x7f06c2c6a5ab in testing::TestSuite::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30 #36 0x7f06c2c96eba in testing::internal::UnitTestImpl::RunAllTests() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44 #37 0x7f06c2c9579d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #38 0x7f06c2c9579d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #39 0x7f06c2c95203 in testing::UnitTest::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10 #40 0x7f06c2d679b8 in RUN_ALL_TESTS() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73 #41 0x7f06c2d679b8 in main /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10 #42 0x7f064c0b3d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #43 0x7f064c0b3e3f in __libc_start_main csu/../csu/libc-start.c:392:3 #44 0x5647cfc7b044 in _start (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/gpu_compiler_test_gpu_amd_any+0xff044) (BuildId: ef1ac485eb61840d0e2233a2cca69eec) 0x5030001d97f8 is located 8 bytes before 32-byte region [0x5030001d9800,0x5030001d9820) allocated by thread T0 here: #0 0x5647cfd1527f in malloc (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/gpu_compiler_test_gpu_amd_any+0x19927f) (BuildId: ef1ac485eb61840d0e2233a2cca69eec) #1 0x7f064c39798b in operator new(unsigned long) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xae98b) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2) #2 0x7f06b31bb5b7 in google::protobuf::Duration* google::protobuf::MessageLite::CreateMaybeMessage<google::protobuf::Duration>(google::protobuf::Arena*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_protobuf/src/google/protobuf/message_lite.h:425:12 #3 0x7f06b31bb5b7 in xla::AutotuneResult::_internal_mutable_run_time() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/autotuning.pb.h:3079:15 #4 0x7f06b31bb5b7 in xla::AutotuneResult::mutable_run_time() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/autotuning.pb.h:3085:45 #5 0x7f06b31bb5b7 in absl::lts_20230802::StatusOr<xla::AutotuneResult> xla::gpu::(anonymous namespace)::GemmAutotuner::GetBestAlgorithm<long, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&>(xla::HloInstruction const*, absl::lts_20230802::Span<long const>, double, bool, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:321:15 #6 0x7f06b31bb5b7 in xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:256:12 #7 0x7f06b31bb5b7 in xla::gpu::(anonymous namespace)::GemmAutotuner::operator()(xla::HloInstruction const*, xla::gpu::AutotuneCacheKey const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:137:18 #8 0x7f06b31b6760 in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0::operator()() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #9 0x7f06b31b6760 in absl::lts_20230802::StatusOr<xla::AutotuneResult> std::__invoke_impl<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(std::__invoke_other, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 #10 0x7f06b31b6760 in std::enable_if<is_invocable_r_v<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>, absl::lts_20230802::StatusOr<xla::AutotuneResult>>::type std::__invoke_r<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:114:9 #11 0x7f06b31b6760 in std::_Function_handler<absl::lts_20230802::StatusOr<xla::AutotuneResult> (), xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0>::_M_invoke(std::_Any_data const&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:290:9 #12 0x7f06b308670d in std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()>::operator()() const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:590:9 #13 0x7f06b308670d in xla::gpu::AutotunerUtil::Autotune(xla::HloInstruction const*, xla::gpu::AutotuneConfig const&, std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/autotuner_util.cc:460:3 #14 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #15 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnComputation(xla::HloComputation*, xla::gpu::(anonymous namespace)::GemmAutotuner&, unsigned long*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:468:7 #16 0x7f06b31b336e in xla::gpu::GemmAlgorithmPicker::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:495:5 #17 0x7f06b30242f3 in xla::HloPassPipeline::RunHelper(xla::HloPassInterface*, xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_pipeline.h:150:5 #18 0x7f06b3010bb9 in absl::lts_20230802::StatusOr<bool> xla::HloPassPipeline::RunPassesInternal<xla::HloModule>(xla::HloModule*, xla::DebugOptions const&, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:198:30 #19 0x7f06b300f786 in xla::HloPassPipeline::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:338:10 #20 0x5647cfd66945 in xla::HloPassInterface::Run(xla::HloModule*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_interface.h:85:12 #21 0x7f06c2908be0 in xla::gpu::GpuCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1754:3 #22 0x7f06c2a000f3 in xla::gpu::AMDGPUCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/amdgpu_compiler.cc:197:3 #23 0x7f06c28f85e9 in xla::gpu::GpuCompiler::OptimizeHloModule(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1392:3 #24 0x7f06c291250d in xla::gpu::GpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1824:3 #25 0x5647cfd63784 in xla::Compiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, stream_executor::DeviceMemoryAllocator*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/service/compiler.h:177:12 #26 0x7f06c339acba in xla::HloTestBase::GetOptimizedModule(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/hlo_test_base.cc:188:32 #27 0x5647cfd89516 in xla::gpu::(anonymous namespace)::GpuCompilerTest_CollectivePermuteDecompositionAndPipelining_Test::TestBody() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler_test.cc:879:3 #28 0x7f06c2c649dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #29 0x7f06c2c649dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #30 0x7f06c2c64708 in testing::Test::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5 #31 0x7f06c2c6771b in testing::TestInfo::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11 #32 0x7f06c2c6a5ab in testing::TestSuite::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30 #33 0x7f06c2c96eba in testing::internal::UnitTestImpl::RunAllTests() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44 #34 0x7f06c2c9579d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #35 0x7f06c2c9579d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #36 0x7f06c2c95203 in testing::UnitTest::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10 #37 0x7f06c2d679b8 in RUN_ALL_TESTS() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73 #38 0x7f06c2d679b8 in main /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10 #39 0x7f064c0b3d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 SUMMARY: AddressSanitizer: heap-buffer-overflow /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:990:36 in absl::lts_20230802::container_internal::CommonFields::capacity() const Shadow bytes around the buggy address: 0x5030001d9500: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa 0x5030001d9580: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd 0x5030001d9600: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa 0x5030001d9680: fd fd fd fd fa fa fd fd fd fa fa fa fd fd fd fa 0x5030001d9700: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd =>0x5030001d9780: fd fa fa fa 00 00 00 fa fa fa 00 00 00 00 fa[fa] 0x5030001d9800: 00 00 00 00 fa fa 00 00 00 00 fa fa fd fd fd fd 0x5030001d9880: fa fa fd fd fd fd fa fa fd fd fd fa fa fa fd fd 0x5030001d9900: fd fd fa fa fd fd fd fd fa fa fd fd fd fd fa fa 0x5030001d9980: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa 0x5030001d9a00: fa fa fd fd fd fa fa fa fd fd fd fd fa fa fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==1718600==ABORTING ``` Copybara import of the project: -- 9a75d26 by alekstheod <[email protected]>: Fix access memory asan issue in redzone_allocator_kernel_rocm.cu Merging this change closes #24898 FUTURE_COPYBARA_INTEGRATE_REVIEW=#24898 from ROCm:ci_fix_asan_invalid_memory_access_in_redzone_allocator_kernel_rocm_cu 9a75d26 PiperOrigin-RevId: 745536108
copybara-service bot
pushed a commit
that referenced
this pull request
Apr 9, 2025
… kernel rocm cu Imported from GitHub PR #24898 Fix issue reported by asan while running the tests on rocm ci: ``` ==1718600==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x5030001d97f8 at pc 0x5647cfdda211 bp 0x7ffc9eb7eac0 sp 0x7ffc9eb7eab8 READ of size 8 at 0x5030001d97f8 thread T0 #0 0x5647cfdda210 in absl::lts_20230802::container_internal::CommonFields::capacity() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:990:36 #1 0x5647cfdda210 in void absl::lts_20230802::container_internal::InitializeSlots<std::allocator<char>, 8ul, 8ul>(absl::lts_20230802::container_internal::CommonFields&, std::allocator<char>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1403:24 #2 0x7f066c2cfdde in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::resize(unsigned long) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9dde) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #3 0x7f066c2cfd97 in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::prepare_insert(unsigned long) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9d97) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #4 0x7f066c2cfcca in std::pair<unsigned long, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::find_or_prepare_insert<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>(std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const&) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x9cca) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #5 0x7f066c2cf9c4 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::NodeHashMapPolicy<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>, absl::lts_20230802::hash_internal::Hash<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::equal_to<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>>, std::allocator<std::pair<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const, stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>>>>::EmplaceDecomposable::operator()<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>, std::piecewise_construct_t const&, std::tuple<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>&>, std::tuple<stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>&&>>(std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*> const&, std::piecewise_construct_t const&, std::tuple<std::tuple<stream_executor::StreamExecutor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, void*>&>&&, std::tuple<stream_executor::TypedKernel<stream_executor::DeviceMemory<unsigned char>, unsigned char, unsigned long, stream_executor::DeviceMemory<unsigned long>>&&>&&) const (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x99c4) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #6 0x7f066c2cf0ad in stream_executor::GetComparisonKernel(stream_executor::StreamExecutor*, stream_executor::GpuAsmOpts) (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/../../../_solib_local/libxla_Sstream_Uexecutor_Sgpu_Slibredzone_Uallocator_Ukernel_Urocm_Urocm.so+0x90ad) (BuildId: 3bd12bfb947fb25a2a780cc09bea1d9c) #7 0x7f066c37ba93 in stream_executor::RedzoneAllocator::CheckRedzones() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/stream_executor/gpu/redzone_allocator.cc:272:3 #8 0x7f06b31bb7e9 in absl::lts_20230802::StatusOr<xla::AutotuneResult> xla::gpu::(anonymous namespace)::GemmAutotuner::GetBestAlgorithm<long, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&>(xla::HloInstruction const*, absl::lts_20230802::Span<long const>, double, bool, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:328:7 #9 0x7f06b31bb7e9 in xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:256:12 #10 0x7f06b31bb7e9 in xla::gpu::(anonymous namespace)::GemmAutotuner::operator()(xla::HloInstruction const*, xla::gpu::AutotuneCacheKey const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:137:18 #11 0x7f06b31b6760 in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0::operator()() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #12 0x7f06b31b6760 in absl::lts_20230802::StatusOr<xla::AutotuneResult> std::__invoke_impl<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(std::__invoke_other, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 #13 0x7f06b31b6760 in std::enable_if<is_invocable_r_v<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>, absl::lts_20230802::StatusOr<xla::AutotuneResult>>::type std::__invoke_r<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:114:9 #14 0x7f06b31b6760 in std::_Function_handler<absl::lts_20230802::StatusOr<xla::AutotuneResult> (), xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0>::_M_invoke(std::_Any_data const&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:290:9 #15 0x7f06b308670d in std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()>::operator()() const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:590:9 #16 0x7f06b308670d in xla::gpu::AutotunerUtil::Autotune(xla::HloInstruction const*, xla::gpu::AutotuneConfig const&, std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/autotuner_util.cc:460:3 #17 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #18 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnComputation(xla::HloComputation*, xla::gpu::(anonymous namespace)::GemmAutotuner&, unsigned long*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:468:7 #19 0x7f06b31b336e in xla::gpu::GemmAlgorithmPicker::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:495:5 #20 0x7f06b30242f3 in xla::HloPassPipeline::RunHelper(xla::HloPassInterface*, xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_pipeline.h:150:5 #21 0x7f06b3010bb9 in absl::lts_20230802::StatusOr<bool> xla::HloPassPipeline::RunPassesInternal<xla::HloModule>(xla::HloModule*, xla::DebugOptions const&, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:198:30 #22 0x7f06b300f786 in xla::HloPassPipeline::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:338:10 #23 0x5647cfd66945 in xla::HloPassInterface::Run(xla::HloModule*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_interface.h:85:12 #24 0x7f06c2908be0 in xla::gpu::GpuCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1754:3 #25 0x7f06c2a000f3 in xla::gpu::AMDGPUCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/amdgpu_compiler.cc:197:3 #26 0x7f06c28f85e9 in xla::gpu::GpuCompiler::OptimizeHloModule(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1392:3 #27 0x7f06c291250d in xla::gpu::GpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1824:3 #28 0x5647cfd63784 in xla::Compiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, stream_executor::DeviceMemoryAllocator*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/service/compiler.h:177:12 #29 0x7f06c339acba in xla::HloTestBase::GetOptimizedModule(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/hlo_test_base.cc:188:32 #30 0x5647cfd89516 in xla::gpu::(anonymous namespace)::GpuCompilerTest_CollectivePermuteDecompositionAndPipelining_Test::TestBody() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler_test.cc:879:3 #31 0x7f06c2c649dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #32 0x7f06c2c649dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #33 0x7f06c2c64708 in testing::Test::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5 #34 0x7f06c2c6771b in testing::TestInfo::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11 #35 0x7f06c2c6a5ab in testing::TestSuite::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30 #36 0x7f06c2c96eba in testing::internal::UnitTestImpl::RunAllTests() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44 #37 0x7f06c2c9579d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #38 0x7f06c2c9579d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #39 0x7f06c2c95203 in testing::UnitTest::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10 #40 0x7f06c2d679b8 in RUN_ALL_TESTS() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73 #41 0x7f06c2d679b8 in main /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10 #42 0x7f064c0b3d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #43 0x7f064c0b3e3f in __libc_start_main csu/../csu/libc-start.c:392:3 #44 0x5647cfc7b044 in _start (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/gpu_compiler_test_gpu_amd_any+0xff044) (BuildId: ef1ac485eb61840d0e2233a2cca69eec) 0x5030001d97f8 is located 8 bytes before 32-byte region [0x5030001d9800,0x5030001d9820) allocated by thread T0 here: #0 0x5647cfd1527f in malloc (/root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/service/gpu/gpu_compiler_test_gpu_amd_any+0x19927f) (BuildId: ef1ac485eb61840d0e2233a2cca69eec) #1 0x7f064c39798b in operator new(unsigned long) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xae98b) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2) #2 0x7f06b31bb5b7 in google::protobuf::Duration* google::protobuf::MessageLite::CreateMaybeMessage<google::protobuf::Duration>(google::protobuf::Arena*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_protobuf/src/google/protobuf/message_lite.h:425:12 #3 0x7f06b31bb5b7 in xla::AutotuneResult::_internal_mutable_run_time() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/autotuning.pb.h:3079:15 #4 0x7f06b31bb5b7 in xla::AutotuneResult::mutable_run_time() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/bazel-out/k8-opt/bin/xla/autotuning.pb.h:3085:45 #5 0x7f06b31bb5b7 in absl::lts_20230802::StatusOr<xla::AutotuneResult> xla::gpu::(anonymous namespace)::GemmAutotuner::GetBestAlgorithm<long, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&>(xla::HloInstruction const*, absl::lts_20230802::Span<long const>, double, bool, xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&)::'lambda'(long const&)&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:321:15 #6 0x7f06b31bb5b7 in xla::gpu::(anonymous namespace)::GemmAutotuner::TuneGpuBlas(xla::HloInstruction const*, xla::gpu::GemmConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:256:12 #7 0x7f06b31bb5b7 in xla::gpu::(anonymous namespace)::GemmAutotuner::operator()(xla::HloInstruction const*, xla::gpu::AutotuneCacheKey const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:137:18 #8 0x7f06b31b6760 in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0::operator()() const /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #9 0x7f06b31b6760 in absl::lts_20230802::StatusOr<xla::AutotuneResult> std::__invoke_impl<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(std::__invoke_other, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 #10 0x7f06b31b6760 in std::enable_if<is_invocable_r_v<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>, absl::lts_20230802::StatusOr<xla::AutotuneResult>>::type std::__invoke_r<absl::lts_20230802::StatusOr<xla::AutotuneResult>, xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&>(xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:114:9 #11 0x7f06b31b6760 in std::_Function_handler<absl::lts_20230802::StatusOr<xla::AutotuneResult> (), xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&)::$_0>::_M_invoke(std::_Any_data const&) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:290:9 #12 0x7f06b308670d in std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()>::operator()() const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:590:9 #13 0x7f06b308670d in xla::gpu::AutotunerUtil::Autotune(xla::HloInstruction const*, xla::gpu::AutotuneConfig const&, std::function<absl::lts_20230802::StatusOr<xla::AutotuneResult> ()> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/autotuner_util.cc:460:3 #14 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnInstruction(xla::HloInstruction*, xla::gpu::(anonymous namespace)::GemmAutotuner&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:418:3 #15 0x7f06b31b336e in xla::gpu::(anonymous namespace)::RunOnComputation(xla::HloComputation*, xla::gpu::(anonymous namespace)::GemmAutotuner&, unsigned long*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:468:7 #16 0x7f06b31b336e in xla::gpu::GemmAlgorithmPicker::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/autotuning/gemm_algorithm_picker.cc:495:5 #17 0x7f06b30242f3 in xla::HloPassPipeline::RunHelper(xla::HloPassInterface*, xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_pipeline.h:150:5 #18 0x7f06b3010bb9 in absl::lts_20230802::StatusOr<bool> xla::HloPassPipeline::RunPassesInternal<xla::HloModule>(xla::HloModule*, xla::DebugOptions const&, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:198:30 #19 0x7f06b300f786 in xla::HloPassPipeline::Run(xla::HloModule*, absl::lts_20230802::flat_hash_set<std::basic_string_view<char, std::char_traits<char>>, absl::lts_20230802::container_internal::StringHash, absl::lts_20230802::container_internal::StringEq, std::allocator<std::basic_string_view<char, std::char_traits<char>>>> const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/hlo/pass/hlo_pass_pipeline.cc:338:10 #20 0x5647cfd66945 in xla::HloPassInterface::Run(xla::HloModule*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/hlo/pass/hlo_pass_interface.h:85:12 #21 0x7f06c2908be0 in xla::gpu::GpuCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1754:3 #22 0x7f06c2a000f3 in xla::gpu::AMDGPUCompiler::OptimizeHloPostLayoutAssignment(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&, tsl::thread::ThreadPool*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/amdgpu_compiler.cc:197:3 #23 0x7f06c28f85e9 in xla::gpu::GpuCompiler::OptimizeHloModule(xla::HloModule*, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&, xla::Compiler::TargetConfig const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1392:3 #24 0x7f06c291250d in xla::gpu::GpuCompiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, xla::Compiler::CompileOptions const&) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler.cc:1824:3 #25 0x5647cfd63784 in xla::Compiler::RunHloPasses(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, stream_executor::StreamExecutor*, stream_executor::DeviceMemoryAllocator*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/./xla/service/compiler.h:177:12 #26 0x7f06c339acba in xla::HloTestBase::GetOptimizedModule(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/hlo_test_base.cc:188:32 #27 0x5647cfd89516 in xla::gpu::(anonymous namespace)::GpuCompilerTest_CollectivePermuteDecompositionAndPipelining_Test::TestBody() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/service/gpu/gpu_compiler_test.cc:879:3 #28 0x7f06c2c649dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #29 0x7f06c2c649dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #30 0x7f06c2c64708 in testing::Test::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5 #31 0x7f06c2c6771b in testing::TestInfo::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11 #32 0x7f06c2c6a5ab in testing::TestSuite::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30 #33 0x7f06c2c96eba in testing::internal::UnitTestImpl::RunAllTests() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44 #34 0x7f06c2c9579d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #35 0x7f06c2c9579d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #36 0x7f06c2c95203 in testing::UnitTest::Run() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10 #37 0x7f06c2d679b8 in RUN_ALL_TESTS() /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73 #38 0x7f06c2d679b8 in main /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10 #39 0x7f064c0b3d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 SUMMARY: AddressSanitizer: heap-buffer-overflow /root/.cache/bazel/_bazel_root/f367074f9120c6f1a67d35844ac058a3/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:990:36 in absl::lts_20230802::container_internal::CommonFields::capacity() const Shadow bytes around the buggy address: 0x5030001d9500: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa 0x5030001d9580: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd 0x5030001d9600: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa 0x5030001d9680: fd fd fd fd fa fa fd fd fd fa fa fa fd fd fd fa 0x5030001d9700: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd =>0x5030001d9780: fd fa fa fa 00 00 00 fa fa fa 00 00 00 00 fa[fa] 0x5030001d9800: 00 00 00 00 fa fa 00 00 00 00 fa fa fd fd fd fd 0x5030001d9880: fa fa fd fd fd fd fa fa fd fd fd fa fa fa fd fd 0x5030001d9900: fd fd fa fa fd fd fd fd fa fa fd fd fd fd fa fa 0x5030001d9980: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa 0x5030001d9a00: fa fa fd fd fd fa fa fa fd fd fd fd fa fa fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==1718600==ABORTING ``` Copybara import of the project: -- 9a75d26 by alekstheod <[email protected]>: Fix access memory asan issue in redzone_allocator_kernel_rocm.cu Merging this change closes #24898 COPYBARA_INTEGRATE_REVIEW=#24898 from ROCm:ci_fix_asan_invalid_memory_access_in_redzone_allocator_kernel_rocm_cu 9a75d26 PiperOrigin-RevId: 745563669
copybara-service bot
pushed a commit
that referenced
this pull request
Apr 9, 2025
Imported from GitHub PR #24900 Fix asan memory access violation: ``` exec ${PAGER:-/usr/bin/less} "$0" || exit 1 Executing tests from //xla/service:elemental_ir_emitter_test_gpu_amd_any ----------------------------------------------------------------------------- Running test /home/atheodor/projects/tmp/xla_asan/execroot/xla/bazel-out/k8-opt/bin/xla/service/elemental_ir_emitter_test_gpu_amd_any.runfiles/xla/xla/service/elemental_ir_emitter_test_gpu_amd_any --gtest_shuffle --gtest_fail_if_no_test_linked on GPU 3 Note: Randomizing tests' orders with a seed of 19906 . [==========] Running 118 tests from 13 test suites. [----------] Global test environment set-up. [----------] 10 tests from ElementalIrEmitterExecutionTypedTest/7, where TypeParam = ml_dtypes::float8_internal::float8_e5m2 [ RUN ] ElementalIrEmitterExecutionTypedTest/7.ConvertFloatsToFloat ================================================================= ==2457579==ERROR: AddressSanitizer: use-after-poison on address 0x506000843a08 at pc 0x7f401151be6a bp 0x7ffd1e3c3410 sp 0x7ffd1e3c3408 READ of size 8 at 0x506000843a08 thread T0 #0 0x7f401151be69 in stream_executor::gpu::RocmExecutor::UnloadGpuBinary(stream_executor::ModuleHandle) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:596:23 #1 0x7f401151b036 in stream_executor::gpu::RocmExecutor::UnloadModule(stream_executor::ModuleHandle) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:496:10 #2 0x7f405dee713b in stream_executor::ScopedModuleHandle::~ScopedModuleHandle() /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/stream_executor/scoped_module_handle.h:48:7 #3 0x7f405dee713b in std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>::~pair() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_iterator.h:2488:12 #4 0x7f405dee713b in void __gnu_cxx::new_allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>::destroy<std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>>(std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/new_allocator.h:168:10 #5 0x7f405dee713b in void std::allocator_traits<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>::destroy<std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>&, std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:535:8 #6 0x7f405dee713b in void absl::lts_20230802::container_internal::map_slot_policy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>::destroy<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>*, absl::lts_20230802::container_internal::map_slot_type<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/container_memory.h:419:7 #7 0x7f405dee713b in void absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>::destroy<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>*, absl::lts_20230802::container_internal::map_slot_type<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/flat_hash_map.h:578:5 #8 0x7f405dee713b in void absl::lts_20230802::container_internal::common_policy_traits<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>, void>::destroy<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>*, absl::lts_20230802::container_internal::map_slot_type<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/common_policy_traits.h:50:5 #9 0x7f405dee713b in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Hash, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Eq, std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>::destroy_slots() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1946:9 #10 0x7f405dee713b in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Hash, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Eq, std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>::~raw_hash_set() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1885:5 #11 0x7f405dee8580 in xla::gpu::GpuExecutable::~GpuExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:155:1 #12 0x7f405dee8d4d in xla::gpu::GpuExecutable::~GpuExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:151:33 #13 0x7f407b818b3f in std::default_delete<xla::Executable>::operator()(xla::Executable*) const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85:2 #14 0x7f407b818b3f in std::unique_ptr<xla::Executable, std::default_delete<xla::Executable>>::~unique_ptr() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:361:4 #15 0x7f407b818b3f in xla::(anonymous namespace)::HloRunnerExecutable::~HloRunnerExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:69:7 #16 0x7f407b818b3f in xla::(anonymous namespace)::HloRunnerExecutable::~HloRunnerExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:69:7 #17 0x7f407b7e6503 in std::default_delete<xla::OpaqueExecutable>::operator()(xla::OpaqueExecutable*) const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85:2 #18 0x7f407b7e6503 in std::unique_ptr<xla::OpaqueExecutable, std::default_delete<xla::OpaqueExecutable>>::~unique_ptr() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:361:4 #19 0x7f407b7e6503 in xla::HloRunner::ExecuteWithMovedDeviceBuffersAndBufferAssignment(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, xla::BufferAssignmentProto const*, std::vector<xla::ScopedShapedBuffer, std::allocator<xla::ScopedShapedBuffer>>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:400:1 #20 0x7f407b7e57c3 in xla::HloRunner::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:221:3 #21 0x55b8a3cb4622 in xla::HloRunnerInterface::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/service/hlo_runner_interface.h:244:12 #22 0x55b8a3cb4622 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompareInternal(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, bool, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:238:5 #23 0x55b8a3cbf766 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:94:9 #24 0x55b8a3cbf235 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&, std::optional<long>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:140:12 #25 0x55b8a3cceda8 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTest::RunTypeConversionTest(std::basic_string_view<char, std::char_traits<char>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:76:5 #26 0x55b8a3cd8cf3 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTypedTest_ConvertFloatsToFloat_Test<ml_dtypes::float8_internal::float8_e5m2>::TestBody() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:472:36 #27 0x7f407b2f09dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #28 0x7f407b2f09dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #29 0x7f407b2f0708 in testing::Test::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5 #30 0x7f407b2f371b in testing::TestInfo::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11 #31 0x7f407b2f65ab in testing::TestSuite::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30 #32 0x7f407b322eba in testing::internal::UnitTestImpl::RunAllTests() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44 #33 0x7f407b32179d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #34 0x7f407b32179d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #35 0x7f407b321203 in testing::UnitTest::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10 #36 0x7f407b3f59b8 in RUN_ALL_TESTS() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73 #37 0x7f407b3f59b8 in main /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10 #38 0x7f4004766d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #39 0x7f4004766e3f in __libc_start_main csu/../csu/libc-start.c:392:3 #40 0x55b8a3b9be44 in _start (/home/atheodor/projects/tmp/xla_asan/execroot/xla/bazel-out/k8-opt/bin/xla/service/elemental_ir_emitter_test_gpu_amd_any+0x10ce44) (BuildId: 1c37d17e488373aad7bf33204cb4234e) 0x506000843a08 is located 40 bytes inside of 56-byte region [0x5060008439e0,0x506000843a18) allocated by thread T0 here: #0 0x55b8a3c3607f in malloc (/home/atheodor/projects/tmp/xla_asan/execroot/xla/bazel-out/k8-opt/bin/xla/service/elemental_ir_emitter_test_gpu_amd_any+0x1a707f) (BuildId: 1c37d17e488373aad7bf33204cb4234e) #1 0x7f4004a4a98b in operator new(unsigned long) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xae98b) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2) #2 0x7f40115449aa in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::initialize_slots() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2505:5 #3 0x7f40115449aa in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::resize(unsigned long) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2515:5 #4 0x7f40115443fa in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::prepare_insert(unsigned long) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2672:7 #5 0x7f40115442df in std::pair<unsigned long, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::find_or_prepare_insert<stream_executor::ModuleHandle>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2659:13 #6 0x7f4011524701 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::try_emplace_impl<stream_executor::ModuleHandle const&>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_map.h:202:22 #7 0x7f4011524701 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::try_emplace<stream_executor::ModuleHandle, 0>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_map.h:139:12 #8 0x7f4011524701 in decltype(absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>::value(std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>* std::addressof<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>(std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>&)(decltype(__declval<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>(0)) std::declval<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>&>()()))) absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::operator[]<stream_executor::ModuleHandle, absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_map.h:184:28 #9 0x7f4011524701 in stream_executor::gpu::RocmExecutor::LoadModuleFromHsaco(char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:717:39 #10 0x7f4011524387 in stream_executor::gpu::RocmExecutor::LoadModule(stream_executor::MultiModuleLoaderSpec const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:705:12 #11 0x7f405deeae34 in xla::gpu::GpuExecutable::ResolveConstantGlobals(stream_executor::Stream*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:499:5 #12 0x7f405def050a in xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl(xla::ServiceExecutableRunOptions const*, std::variant<absl::lts_20230802::Span<xla::ShapedBuffer const* const>, absl::lts_20230802::Span<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:703:5 #13 0x7f405deefc6f in xla::gpu::GpuExecutable::ExecuteAsyncOnStream(xla::ServiceExecutableRunOptions const*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:661:10 #14 0x7f401607a78e in xla::Executable::ExecuteAsyncOnStreamWrapper(xla::ServiceExecutableRunOptions const*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/executable.cc:229:7 #15 0x7f4016079fd3 in xla::Executable::ExecuteOnStreamWrapper(xla::ServiceExecutableRunOptions const*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/executable.cc:153:7 #16 0x7f407b7ea78b in xla::HloRunner::ExecuteWithExecutionInputs(xla::Executable*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:448:3 #17 0x7f407b7ecde2 in xla::HloRunner::ExecuteWithMovedDeviceBuffers(xla::Executable*, std::vector<xla::ScopedShapedBuffer, std::allocator<xla::ScopedShapedBuffer>>, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:415:3 #18 0x7f407b7e642a in xla::HloRunner::ExecuteWithMovedDeviceBuffersAndBufferAssignment(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, xla::BufferAssignmentProto const*, std::vector<xla::ScopedShapedBuffer, std::allocator<xla::ScopedShapedBuffer>>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:398:10 #19 0x7f407b7e57c3 in xla::HloRunner::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:221:3 #20 0x55b8a3cb4622 in xla::HloRunnerInterface::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/service/hlo_runner_interface.h:244:12 #21 0x55b8a3cb4622 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompareInternal(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, bool, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:238:5 #22 0x55b8a3cbf766 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:94:9 #23 0x55b8a3cbf235 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&, std::optional<long>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:140:12 #24 0x55b8a3cceda8 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTest::RunTypeConversionTest(std::basic_string_view<char, std::char_traits<char>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:76:5 #25 0x55b8a3cd8cf3 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTypedTest_ConvertFloatsToFloat_Test<ml_dtypes::float8_internal::float8_e5m2>::TestBody() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:472:36 #26 0x7f407b2f09dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #27 0x7f407b2f09dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #28 0x7f407b2f0708 in testing::Test::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5 #29 0x7f407b2f371b in testing::TestInfo::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11 #30 0x7f407b2f65ab in testing::TestSuite::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30 #31 0x7f407b322eba in testing::internal::UnitTestImpl::RunAllTests() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44 #32 0x7f407b32179d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #33 0x7f407b32179d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #34 0x7f407b321203 in testing::UnitTest::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10 #35 0x7f407b3f59b8 in RUN_ALL_TESTS() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73 #36 0x7f407b3f59b8 in main /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10 #37 0x7f4004766d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 SUMMARY: AddressSanitizer: use-after-poison /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:596:23 in stream_executor::gpu::RocmExecutor::UnloadGpuBinary(stream_executor::ModuleHandle) Shadow bytes around the buggy address: 0x506000843780: fa fa fa fa fd fd fd fd fd fd fd fa fa fa fa fa 0x506000843800: fd fd fd fd fd fd fd fa fa fa fa fa fd fd fd fd 0x506000843880: fd fd fd fa fa fa fa fa fd fd fd fd fd fd fd fd 0x506000843900: fa fa fa fa 00 00 00 00 00 00 00 fa fa fa fa fa 0x506000843980: fd fd fd fd fd fd fd fa fa fa fa fa 00 00 00 00 =>0x506000843a00: f7[f7]f7 fa fa fa fa fa 00 00 00 00 00 00 00 00 0x506000843a80: fa fa fa fa 00 00 00 00 00 00 00 fa fa fa fa fa 0x506000843b00: 00 00 00 00 00 00 00 fa fa fa fa fa 00 00 00 00 0x506000843b80: 00 00 00 fa fa fa fa fa 00 00 00 00 00 00 00 fa 0x506000843c00: fa fa fa fa 00 00 00 00 00 00 00 fa fa fa fa fa 0x506000843c80: 00 00 00 00 00 00 00 fa fa fa fa fa fd fd fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==2457579==ABORTING ``` Copybara import of the project: -- 8f74d4c by alekstheod <[email protected]>: Fix asan report memory access vialation in rocm_executor Merging this change closes #24900 FUTURE_COPYBARA_INTEGRATE_REVIEW=#24900 from ROCm:ci_fix_invalid_memory_access_in_rocm_executor 8f74d4c PiperOrigin-RevId: 745548395
copybara-service bot
pushed a commit
that referenced
this pull request
Apr 9, 2025
Imported from GitHub PR #24900 Fix asan memory access violation: ``` exec ${PAGER:-/usr/bin/less} "$0" || exit 1 Executing tests from //xla/service:elemental_ir_emitter_test_gpu_amd_any ----------------------------------------------------------------------------- Running test /home/atheodor/projects/tmp/xla_asan/execroot/xla/bazel-out/k8-opt/bin/xla/service/elemental_ir_emitter_test_gpu_amd_any.runfiles/xla/xla/service/elemental_ir_emitter_test_gpu_amd_any --gtest_shuffle --gtest_fail_if_no_test_linked on GPU 3 Note: Randomizing tests' orders with a seed of 19906 . [==========] Running 118 tests from 13 test suites. [----------] Global test environment set-up. [----------] 10 tests from ElementalIrEmitterExecutionTypedTest/7, where TypeParam = ml_dtypes::float8_internal::float8_e5m2 [ RUN ] ElementalIrEmitterExecutionTypedTest/7.ConvertFloatsToFloat ================================================================= ==2457579==ERROR: AddressSanitizer: use-after-poison on address 0x506000843a08 at pc 0x7f401151be6a bp 0x7ffd1e3c3410 sp 0x7ffd1e3c3408 READ of size 8 at 0x506000843a08 thread T0 #0 0x7f401151be69 in stream_executor::gpu::RocmExecutor::UnloadGpuBinary(stream_executor::ModuleHandle) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:596:23 #1 0x7f401151b036 in stream_executor::gpu::RocmExecutor::UnloadModule(stream_executor::ModuleHandle) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:496:10 #2 0x7f405dee713b in stream_executor::ScopedModuleHandle::~ScopedModuleHandle() /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/stream_executor/scoped_module_handle.h:48:7 #3 0x7f405dee713b in std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>::~pair() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_iterator.h:2488:12 #4 0x7f405dee713b in void __gnu_cxx::new_allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>::destroy<std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>>(std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/new_allocator.h:168:10 #5 0x7f405dee713b in void std::allocator_traits<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>::destroy<std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>&, std::pair<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:535:8 #6 0x7f405dee713b in void absl::lts_20230802::container_internal::map_slot_policy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>::destroy<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>*, absl::lts_20230802::container_internal::map_slot_type<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/container_memory.h:419:7 #7 0x7f405dee713b in void absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>::destroy<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>*, absl::lts_20230802::container_internal::map_slot_type<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/flat_hash_map.h:578:5 #8 0x7f405dee713b in void absl::lts_20230802::container_internal::common_policy_traits<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>, void>::destroy<std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>(std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>*, absl::lts_20230802::container_internal::map_slot_type<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/common_policy_traits.h:50:5 #9 0x7f405dee713b in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Hash, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Eq, std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>::destroy_slots() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1946:9 #10 0x7f405dee713b in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::StreamExecutor*, stream_executor::ScopedModuleHandle>, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Hash, absl::lts_20230802::container_internal::HashEq<stream_executor::StreamExecutor*, void>::Eq, std::allocator<std::pair<stream_executor::StreamExecutor* const, stream_executor::ScopedModuleHandle>>>::~raw_hash_set() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:1885:5 #11 0x7f405dee8580 in xla::gpu::GpuExecutable::~GpuExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:155:1 #12 0x7f405dee8d4d in xla::gpu::GpuExecutable::~GpuExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:151:33 #13 0x7f407b818b3f in std::default_delete<xla::Executable>::operator()(xla::Executable*) const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85:2 #14 0x7f407b818b3f in std::unique_ptr<xla::Executable, std::default_delete<xla::Executable>>::~unique_ptr() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:361:4 #15 0x7f407b818b3f in xla::(anonymous namespace)::HloRunnerExecutable::~HloRunnerExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:69:7 #16 0x7f407b818b3f in xla::(anonymous namespace)::HloRunnerExecutable::~HloRunnerExecutable() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:69:7 #17 0x7f407b7e6503 in std::default_delete<xla::OpaqueExecutable>::operator()(xla::OpaqueExecutable*) const /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85:2 #18 0x7f407b7e6503 in std::unique_ptr<xla::OpaqueExecutable, std::default_delete<xla::OpaqueExecutable>>::~unique_ptr() /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:361:4 #19 0x7f407b7e6503 in xla::HloRunner::ExecuteWithMovedDeviceBuffersAndBufferAssignment(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, xla::BufferAssignmentProto const*, std::vector<xla::ScopedShapedBuffer, std::allocator<xla::ScopedShapedBuffer>>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:400:1 #20 0x7f407b7e57c3 in xla::HloRunner::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:221:3 #21 0x55b8a3cb4622 in xla::HloRunnerInterface::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/service/hlo_runner_interface.h:244:12 #22 0x55b8a3cb4622 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompareInternal(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, bool, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:238:5 #23 0x55b8a3cbf766 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:94:9 #24 0x55b8a3cbf235 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&, std::optional<long>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:140:12 #25 0x55b8a3cceda8 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTest::RunTypeConversionTest(std::basic_string_view<char, std::char_traits<char>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:76:5 #26 0x55b8a3cd8cf3 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTypedTest_ConvertFloatsToFloat_Test<ml_dtypes::float8_internal::float8_e5m2>::TestBody() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:472:36 #27 0x7f407b2f09dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #28 0x7f407b2f09dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #29 0x7f407b2f0708 in testing::Test::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5 #30 0x7f407b2f371b in testing::TestInfo::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11 #31 0x7f407b2f65ab in testing::TestSuite::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30 #32 0x7f407b322eba in testing::internal::UnitTestImpl::RunAllTests() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44 #33 0x7f407b32179d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #34 0x7f407b32179d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #35 0x7f407b321203 in testing::UnitTest::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10 #36 0x7f407b3f59b8 in RUN_ALL_TESTS() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73 #37 0x7f407b3f59b8 in main /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10 #38 0x7f4004766d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #39 0x7f4004766e3f in __libc_start_main csu/../csu/libc-start.c:392:3 #40 0x55b8a3b9be44 in _start (/home/atheodor/projects/tmp/xla_asan/execroot/xla/bazel-out/k8-opt/bin/xla/service/elemental_ir_emitter_test_gpu_amd_any+0x10ce44) (BuildId: 1c37d17e488373aad7bf33204cb4234e) 0x506000843a08 is located 40 bytes inside of 56-byte region [0x5060008439e0,0x506000843a18) allocated by thread T0 here: #0 0x55b8a3c3607f in malloc (/home/atheodor/projects/tmp/xla_asan/execroot/xla/bazel-out/k8-opt/bin/xla/service/elemental_ir_emitter_test_gpu_amd_any+0x1a707f) (BuildId: 1c37d17e488373aad7bf33204cb4234e) #1 0x7f4004a4a98b in operator new(unsigned long) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xae98b) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2) #2 0x7f40115449aa in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::initialize_slots() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2505:5 #3 0x7f40115449aa in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::resize(unsigned long) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2515:5 #4 0x7f40115443fa in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::prepare_insert(unsigned long) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2672:7 #5 0x7f40115442df in std::pair<unsigned long, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::find_or_prepare_insert<stream_executor::ModuleHandle>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_set.h:2659:13 #6 0x7f4011524701 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::try_emplace_impl<stream_executor::ModuleHandle const&>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_map.h:202:22 #7 0x7f4011524701 in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::iterator, bool> absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::try_emplace<stream_executor::ModuleHandle, 0>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_map.h:139:12 #8 0x7f4011524701 in decltype(absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>::value(std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>* std::addressof<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>(std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>&)(decltype(__declval<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>(0)) std::declval<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>&>()()))) absl::lts_20230802::container_internal::raw_hash_map<absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>, absl::lts_20230802::hash_internal::Hash<stream_executor::ModuleHandle>, std::equal_to<stream_executor::ModuleHandle>, std::allocator<std::pair<stream_executor::ModuleHandle const, std::pair<ihipModule_t*, unsigned long>>>>::operator[]<stream_executor::ModuleHandle, absl::lts_20230802::container_internal::FlatHashMapPolicy<stream_executor::ModuleHandle, std::pair<ihipModule_t*, unsigned long>>>(stream_executor::ModuleHandle const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_absl/absl/container/internal/raw_hash_map.h:184:28 #9 0x7f4011524701 in stream_executor::gpu::RocmExecutor::LoadModuleFromHsaco(char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:717:39 #10 0x7f4011524387 in stream_executor::gpu::RocmExecutor::LoadModule(stream_executor::MultiModuleLoaderSpec const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:705:12 #11 0x7f405deeae34 in xla::gpu::GpuExecutable::ResolveConstantGlobals(stream_executor::Stream*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:499:5 #12 0x7f405def050a in xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl(xla::ServiceExecutableRunOptions const*, std::variant<absl::lts_20230802::Span<xla::ShapedBuffer const* const>, absl::lts_20230802::Span<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:703:5 #13 0x7f405deefc6f in xla::gpu::GpuExecutable::ExecuteAsyncOnStream(xla::ServiceExecutableRunOptions const*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/gpu/gpu_executable.cc:661:10 #14 0x7f401607a78e in xla::Executable::ExecuteAsyncOnStreamWrapper(xla::ServiceExecutableRunOptions const*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/executable.cc:229:7 #15 0x7f4016079fd3 in xla::Executable::ExecuteOnStreamWrapper(xla::ServiceExecutableRunOptions const*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/executable.cc:153:7 #16 0x7f407b7ea78b in xla::HloRunner::ExecuteWithExecutionInputs(xla::Executable*, std::vector<xla::ExecutionInput, std::allocator<xla::ExecutionInput>>, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:448:3 #17 0x7f407b7ecde2 in xla::HloRunner::ExecuteWithMovedDeviceBuffers(xla::Executable*, std::vector<xla::ScopedShapedBuffer, std::allocator<xla::ScopedShapedBuffer>>, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:415:3 #18 0x7f407b7e642a in xla::HloRunner::ExecuteWithMovedDeviceBuffersAndBufferAssignment(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, xla::BufferAssignmentProto const*, std::vector<xla::ScopedShapedBuffer, std::allocator<xla::ScopedShapedBuffer>>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:398:10 #19 0x7f407b7e57c3 in xla::HloRunner::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool, xla::ExecutionProfile*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/hlo_runner.cc:221:3 #20 0x55b8a3cb4622 in xla::HloRunnerInterface::Execute(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal const* const>, bool) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/service/hlo_runner_interface.h:244:12 #21 0x55b8a3cb4622 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompareInternal(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, bool, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:238:5 #22 0x55b8a3cbf766 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, absl::lts_20230802::Span<xla::Literal* const>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:94:9 #23 0x55b8a3cbf235 in xla::HloRunnerAgnosticReferenceMixin<xla::HloRunnerAgnosticTestBase>::RunAndCompare(std::unique_ptr<xla::HloModule, std::default_delete<xla::HloModule>>, std::optional<xla::ErrorSpec> const&, std::function<void (xla::HloModule*)> const&, std::function<void (xla::HloModule*)> const&, std::optional<long>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/./xla/tests/hlo_runner_agnostic_reference_mixin.h:140:12 #24 0x55b8a3cceda8 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTest::RunTypeConversionTest(std::basic_string_view<char, std::char_traits<char>>) /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:76:5 #25 0x55b8a3cd8cf3 in xla::(anonymous namespace)::ElementalIrEmitterExecutionTypedTest_ConvertFloatsToFloat_Test<ml_dtypes::float8_internal::float8_e5m2>::TestBody() /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/service/elemental_ir_emitter_test.cc:472:36 #26 0x7f407b2f09dd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #27 0x7f407b2f09dd in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #28 0x7f407b2f0708 in testing::Test::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2739:5 #29 0x7f407b2f371b in testing::TestInfo::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2885:11 #30 0x7f407b2f65ab in testing::TestSuite::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:3063:30 #31 0x7f407b322eba in testing::internal::UnitTestImpl::RunAllTests() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:6054:44 #32 0x7f407b32179d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2664:10 #33 0x7f407b32179d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:2700:14 #34 0x7f407b321203 in testing::UnitTest::Run() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/src/gtest.cc:5594:10 #35 0x7f407b3f59b8 in RUN_ALL_TESTS() /home/atheodor/projects/tmp/xla_asan/execroot/xla/external/com_google_googletest/googletest/include/gtest/gtest.h:2334:73 #36 0x7f407b3f59b8 in main /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/tests/xla_internal_test_main.cc:65:10 #37 0x7f4004766d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 SUMMARY: AddressSanitizer: use-after-poison /home/atheodor/projects/tmp/xla_asan/execroot/xla/xla/stream_executor/rocm/rocm_executor.cc:596:23 in stream_executor::gpu::RocmExecutor::UnloadGpuBinary(stream_executor::ModuleHandle) Shadow bytes around the buggy address: 0x506000843780: fa fa fa fa fd fd fd fd fd fd fd fa fa fa fa fa 0x506000843800: fd fd fd fd fd fd fd fa fa fa fa fa fd fd fd fd 0x506000843880: fd fd fd fa fa fa fa fa fd fd fd fd fd fd fd fd 0x506000843900: fa fa fa fa 00 00 00 00 00 00 00 fa fa fa fa fa 0x506000843980: fd fd fd fd fd fd fd fa fa fa fa fa 00 00 00 00 =>0x506000843a00: f7[f7]f7 fa fa fa fa fa 00 00 00 00 00 00 00 00 0x506000843a80: fa fa fa fa 00 00 00 00 00 00 00 fa fa fa fa fa 0x506000843b00: 00 00 00 00 00 00 00 fa fa fa fa fa 00 00 00 00 0x506000843b80: 00 00 00 fa fa fa fa fa 00 00 00 00 00 00 00 fa 0x506000843c00: fa fa fa fa 00 00 00 00 00 00 00 fa fa fa fa fa 0x506000843c80: 00 00 00 00 00 00 00 fa fa fa fa fa fd fd fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==2457579==ABORTING ``` Copybara import of the project: -- 8f74d4c by alekstheod <[email protected]>: Fix asan report memory access vialation in rocm_executor Merging this change closes #24900 COPYBARA_INTEGRATE_REVIEW=#24900 from ROCm:ci_fix_invalid_memory_access_in_rocm_executor 8f74d4c PiperOrigin-RevId: 745592235
copybara-service bot
pushed a commit
that referenced
this pull request
Jul 28, 2025
Fixes jax-ml/jax#30517 Example traceback of crash: ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10) * frame #0: 0x0000000150679fe4 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 164 frame #1: 0x000000015067dc20 libjax_common.dylib`std::__1::__shared_ptr_emplace<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>, std::__1::allocator<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>>>::__on_zero_shared() + 32 frame #2: 0x000000015067a0e4 libjax_common.dylib`std::__1::unique_ptr<jax::(anonymous namespace)::PjitFunctionCache::Value, std::__1::default_delete<jax::(anonymous namespace)::PjitFunctionCache::Value>>::reset[abi:ne180100](jax::(anonymous namespace)::PjitFunctionCache::Value*) + 104 frame #3: 0x000000015067e230 libjax_common.dylib`_object* nanobind::detail::func_create<true, true, jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1, void, nanobind::handle, 0ul>(jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1&&, void (*)(nanobind::handle), std::__1::integer_sequence<unsigned long, 0ul>)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 336 frame #4: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #5: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #6: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #7: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #8: 0x00000001506714c8 libjax_common.dylib`PjitFunction_tp_dealloc + 504 frame #9: 0x0000000106420dd8 libpython3.12.dylib`PyDict_DelItem + 668 frame #10: 0x00000001063cbc38 libpython3.12.dylib`_PyEval_EvalFrameDefault + 26328 frame #11: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #12: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #13: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #14: 0x0000000150676a4c libjax_common.dylib`jax::ArgumentSignature::~ArgumentSignature() + 172 frame #15: 0x0000000150679c88 libjax_common.dylib`jax::CallSignature::~CallSignature() + 456 frame #16: 0x0000000150679fb0 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 112 frame #17: 0x0000000150672400 libjax_common.dylib`jax::(anonymous namespace)::PjitFunctionCache::Clear() + 44 frame #18: 0x000000015067a340 libjax_common.dylib`_object* nanobind::detail::func_create<false, true, void nanobind::cpp_function_def<jax::(anonymous namespace)::PjitFunctionCache, void, jax::(anonymous namespace)::PjitFunctionCache, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(void (jax::(anonymous namespace)::PjitFunctionCache::*)(), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(jax::(anonymous namespace)::PjitFunctionCache*), void, jax::(anonymous namespace)::PjitFunctionCache*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(jax::(anonymous namespace)::PjitFunctionCache&&, void (*)(nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self), std::__1::integer_sequence<unsigned long, 0ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 80 frame #19: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #20: 0x00000001063ed3a8 libpython3.12.dylib`_PyEval_EvalFrameDefault + 163400 frame #21: 0x00000001064e3450 libpython3.12.dylib`atexit_callfuncs.llvm.13196908868581062239 + 96 frame #22: 0x00000001064ece28 libpython3.12.dylib`Py_FinalizeEx + 96 frame #23: 0x000000010650366c libpython3.12.dylib`Py_Exit + 20 frame #24: 0x000000010650364c libpython3.12.dylib`handle_system_exit + 32 frame #25: 0x0000000106503330 libpython3.12.dylib`_PyErr_PrintEx.llvm.12194046240795210664 + 52 frame #26: 0x000000010650de00 libpython3.12.dylib`_PyRun_SimpleFileObject + 464 frame #27: 0x00000001065051e4 libpython3.12.dylib`_PyRun_AnyFileObject + 80 frame #28: 0x00000001065045a0 libpython3.12.dylib`pymain_run_file_obj + 164 frame #29: 0x0000000106503c00 libpython3.12.dylib`pymain_run_file + 72 frame #30: 0x0000000106501e04 libpython3.12.dylib`Py_RunMain + 1120 frame #31: 0x0000000106501808 libpython3.12.dylib`pymain_main + 456 frame #32: 0x0000000106501634 libpython3.12.dylib`Py_BytesMain + 36 frame #33: 0x00000001951fab98 dyld`start + 6076 ``` PiperOrigin-RevId: 787935136
copybara-service bot
pushed a commit
that referenced
this pull request
Jul 28, 2025
Fixes jax-ml/jax#30517 Example traceback of crash: ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10) * frame #0: 0x0000000150679fe4 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 164 frame #1: 0x000000015067dc20 libjax_common.dylib`std::__1::__shared_ptr_emplace<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>, std::__1::allocator<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>>>::__on_zero_shared() + 32 frame #2: 0x000000015067a0e4 libjax_common.dylib`std::__1::unique_ptr<jax::(anonymous namespace)::PjitFunctionCache::Value, std::__1::default_delete<jax::(anonymous namespace)::PjitFunctionCache::Value>>::reset[abi:ne180100](jax::(anonymous namespace)::PjitFunctionCache::Value*) + 104 frame #3: 0x000000015067e230 libjax_common.dylib`_object* nanobind::detail::func_create<true, true, jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1, void, nanobind::handle, 0ul>(jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1&&, void (*)(nanobind::handle), std::__1::integer_sequence<unsigned long, 0ul>)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 336 frame #4: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #5: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #6: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #7: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #8: 0x00000001506714c8 libjax_common.dylib`PjitFunction_tp_dealloc + 504 frame #9: 0x0000000106420dd8 libpython3.12.dylib`PyDict_DelItem + 668 frame #10: 0x00000001063cbc38 libpython3.12.dylib`_PyEval_EvalFrameDefault + 26328 frame #11: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #12: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #13: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #14: 0x0000000150676a4c libjax_common.dylib`jax::ArgumentSignature::~ArgumentSignature() + 172 frame #15: 0x0000000150679c88 libjax_common.dylib`jax::CallSignature::~CallSignature() + 456 frame #16: 0x0000000150679fb0 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 112 frame #17: 0x0000000150672400 libjax_common.dylib`jax::(anonymous namespace)::PjitFunctionCache::Clear() + 44 frame #18: 0x000000015067a340 libjax_common.dylib`_object* nanobind::detail::func_create<false, true, void nanobind::cpp_function_def<jax::(anonymous namespace)::PjitFunctionCache, void, jax::(anonymous namespace)::PjitFunctionCache, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(void (jax::(anonymous namespace)::PjitFunctionCache::*)(), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(jax::(anonymous namespace)::PjitFunctionCache*), void, jax::(anonymous namespace)::PjitFunctionCache*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(jax::(anonymous namespace)::PjitFunctionCache&&, void (*)(nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self), std::__1::integer_sequence<unsigned long, 0ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 80 frame #19: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #20: 0x00000001063ed3a8 libpython3.12.dylib`_PyEval_EvalFrameDefault + 163400 frame #21: 0x00000001064e3450 libpython3.12.dylib`atexit_callfuncs.llvm.13196908868581062239 + 96 frame #22: 0x00000001064ece28 libpython3.12.dylib`Py_FinalizeEx + 96 frame #23: 0x000000010650366c libpython3.12.dylib`Py_Exit + 20 frame #24: 0x000000010650364c libpython3.12.dylib`handle_system_exit + 32 frame #25: 0x0000000106503330 libpython3.12.dylib`_PyErr_PrintEx.llvm.12194046240795210664 + 52 frame #26: 0x000000010650de00 libpython3.12.dylib`_PyRun_SimpleFileObject + 464 frame #27: 0x00000001065051e4 libpython3.12.dylib`_PyRun_AnyFileObject + 80 frame #28: 0x00000001065045a0 libpython3.12.dylib`pymain_run_file_obj + 164 frame #29: 0x0000000106503c00 libpython3.12.dylib`pymain_run_file + 72 frame #30: 0x0000000106501e04 libpython3.12.dylib`Py_RunMain + 1120 frame #31: 0x0000000106501808 libpython3.12.dylib`pymain_main + 456 frame #32: 0x0000000106501634 libpython3.12.dylib`Py_BytesMain + 36 frame #33: 0x00000001951fab98 dyld`start + 6076 ``` PiperOrigin-RevId: 787935136
copybara-service bot
pushed a commit
that referenced
this pull request
Jul 28, 2025
Fixes jax-ml/jax#30517 Example traceback of crash: ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10) * frame #0: 0x0000000150679fe4 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 164 frame #1: 0x000000015067dc20 libjax_common.dylib`std::__1::__shared_ptr_emplace<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>, std::__1::allocator<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>>>::__on_zero_shared() + 32 frame #2: 0x000000015067a0e4 libjax_common.dylib`std::__1::unique_ptr<jax::(anonymous namespace)::PjitFunctionCache::Value, std::__1::default_delete<jax::(anonymous namespace)::PjitFunctionCache::Value>>::reset[abi:ne180100](jax::(anonymous namespace)::PjitFunctionCache::Value*) + 104 frame #3: 0x000000015067e230 libjax_common.dylib`_object* nanobind::detail::func_create<true, true, jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1, void, nanobind::handle, 0ul>(jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1&&, void (*)(nanobind::handle), std::__1::integer_sequence<unsigned long, 0ul>)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 336 frame #4: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #5: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #6: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #7: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #8: 0x00000001506714c8 libjax_common.dylib`PjitFunction_tp_dealloc + 504 frame #9: 0x0000000106420dd8 libpython3.12.dylib`PyDict_DelItem + 668 frame #10: 0x00000001063cbc38 libpython3.12.dylib`_PyEval_EvalFrameDefault + 26328 frame #11: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #12: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #13: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #14: 0x0000000150676a4c libjax_common.dylib`jax::ArgumentSignature::~ArgumentSignature() + 172 frame #15: 0x0000000150679c88 libjax_common.dylib`jax::CallSignature::~CallSignature() + 456 frame #16: 0x0000000150679fb0 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 112 frame #17: 0x0000000150672400 libjax_common.dylib`jax::(anonymous namespace)::PjitFunctionCache::Clear() + 44 frame #18: 0x000000015067a340 libjax_common.dylib`_object* nanobind::detail::func_create<false, true, void nanobind::cpp_function_def<jax::(anonymous namespace)::PjitFunctionCache, void, jax::(anonymous namespace)::PjitFunctionCache, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(void (jax::(anonymous namespace)::PjitFunctionCache::*)(), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(jax::(anonymous namespace)::PjitFunctionCache*), void, jax::(anonymous namespace)::PjitFunctionCache*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(jax::(anonymous namespace)::PjitFunctionCache&&, void (*)(nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self), std::__1::integer_sequence<unsigned long, 0ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 80 frame #19: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #20: 0x00000001063ed3a8 libpython3.12.dylib`_PyEval_EvalFrameDefault + 163400 frame #21: 0x00000001064e3450 libpython3.12.dylib`atexit_callfuncs.llvm.13196908868581062239 + 96 frame #22: 0x00000001064ece28 libpython3.12.dylib`Py_FinalizeEx + 96 frame #23: 0x000000010650366c libpython3.12.dylib`Py_Exit + 20 frame #24: 0x000000010650364c libpython3.12.dylib`handle_system_exit + 32 frame #25: 0x0000000106503330 libpython3.12.dylib`_PyErr_PrintEx.llvm.12194046240795210664 + 52 frame #26: 0x000000010650de00 libpython3.12.dylib`_PyRun_SimpleFileObject + 464 frame #27: 0x00000001065051e4 libpython3.12.dylib`_PyRun_AnyFileObject + 80 frame #28: 0x00000001065045a0 libpython3.12.dylib`pymain_run_file_obj + 164 frame #29: 0x0000000106503c00 libpython3.12.dylib`pymain_run_file + 72 frame #30: 0x0000000106501e04 libpython3.12.dylib`Py_RunMain + 1120 frame #31: 0x0000000106501808 libpython3.12.dylib`pymain_main + 456 frame #32: 0x0000000106501634 libpython3.12.dylib`Py_BytesMain + 36 frame #33: 0x00000001951fab98 dyld`start + 6076 ``` PiperOrigin-RevId: 787935136
copybara-service bot
pushed a commit
that referenced
this pull request
Jul 29, 2025
Fixes jax-ml/jax#30517 Example traceback of crash: ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10) * frame #0: 0x0000000150679fe4 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 164 frame #1: 0x000000015067dc20 libjax_common.dylib`std::__1::__shared_ptr_emplace<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>, std::__1::allocator<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>>>::__on_zero_shared() + 32 frame #2: 0x000000015067a0e4 libjax_common.dylib`std::__1::unique_ptr<jax::(anonymous namespace)::PjitFunctionCache::Value, std::__1::default_delete<jax::(anonymous namespace)::PjitFunctionCache::Value>>::reset[abi:ne180100](jax::(anonymous namespace)::PjitFunctionCache::Value*) + 104 frame #3: 0x000000015067e230 libjax_common.dylib`_object* nanobind::detail::func_create<true, true, jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1, void, nanobind::handle, 0ul>(jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1&&, void (*)(nanobind::handle), std::__1::integer_sequence<unsigned long, 0ul>)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 336 frame #4: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #5: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #6: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #7: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #8: 0x00000001506714c8 libjax_common.dylib`PjitFunction_tp_dealloc + 504 frame #9: 0x0000000106420dd8 libpython3.12.dylib`PyDict_DelItem + 668 frame #10: 0x00000001063cbc38 libpython3.12.dylib`_PyEval_EvalFrameDefault + 26328 frame #11: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #12: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #13: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #14: 0x0000000150676a4c libjax_common.dylib`jax::ArgumentSignature::~ArgumentSignature() + 172 frame #15: 0x0000000150679c88 libjax_common.dylib`jax::CallSignature::~CallSignature() + 456 frame #16: 0x0000000150679fb0 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 112 frame #17: 0x0000000150672400 libjax_common.dylib`jax::(anonymous namespace)::PjitFunctionCache::Clear() + 44 frame #18: 0x000000015067a340 libjax_common.dylib`_object* nanobind::detail::func_create<false, true, void nanobind::cpp_function_def<jax::(anonymous namespace)::PjitFunctionCache, void, jax::(anonymous namespace)::PjitFunctionCache, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(void (jax::(anonymous namespace)::PjitFunctionCache::*)(), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(jax::(anonymous namespace)::PjitFunctionCache*), void, jax::(anonymous namespace)::PjitFunctionCache*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(jax::(anonymous namespace)::PjitFunctionCache&&, void (*)(nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self), std::__1::integer_sequence<unsigned long, 0ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 80 frame #19: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #20: 0x00000001063ed3a8 libpython3.12.dylib`_PyEval_EvalFrameDefault + 163400 frame #21: 0x00000001064e3450 libpython3.12.dylib`atexit_callfuncs.llvm.13196908868581062239 + 96 frame #22: 0x00000001064ece28 libpython3.12.dylib`Py_FinalizeEx + 96 frame #23: 0x000000010650366c libpython3.12.dylib`Py_Exit + 20 frame #24: 0x000000010650364c libpython3.12.dylib`handle_system_exit + 32 frame #25: 0x0000000106503330 libpython3.12.dylib`_PyErr_PrintEx.llvm.12194046240795210664 + 52 frame #26: 0x000000010650de00 libpython3.12.dylib`_PyRun_SimpleFileObject + 464 frame #27: 0x00000001065051e4 libpython3.12.dylib`_PyRun_AnyFileObject + 80 frame #28: 0x00000001065045a0 libpython3.12.dylib`pymain_run_file_obj + 164 frame #29: 0x0000000106503c00 libpython3.12.dylib`pymain_run_file + 72 frame #30: 0x0000000106501e04 libpython3.12.dylib`Py_RunMain + 1120 frame #31: 0x0000000106501808 libpython3.12.dylib`pymain_main + 456 frame #32: 0x0000000106501634 libpython3.12.dylib`Py_BytesMain + 36 frame #33: 0x00000001951fab98 dyld`start + 6076 ``` PiperOrigin-RevId: 787935136
copybara-service bot
pushed a commit
that referenced
this pull request
Jul 29, 2025
Fixes jax-ml/jax#30517 Example traceback of crash: ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10) * frame #0: 0x0000000150679fe4 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 164 frame #1: 0x000000015067dc20 libjax_common.dylib`std::__1::__shared_ptr_emplace<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>, std::__1::allocator<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>>>::__on_zero_shared() + 32 frame #2: 0x000000015067a0e4 libjax_common.dylib`std::__1::unique_ptr<jax::(anonymous namespace)::PjitFunctionCache::Value, std::__1::default_delete<jax::(anonymous namespace)::PjitFunctionCache::Value>>::reset[abi:ne180100](jax::(anonymous namespace)::PjitFunctionCache::Value*) + 104 frame #3: 0x000000015067e230 libjax_common.dylib`_object* nanobind::detail::func_create<true, true, jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1, void, nanobind::handle, 0ul>(jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1&&, void (*)(nanobind::handle), std::__1::integer_sequence<unsigned long, 0ul>)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 336 frame #4: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #5: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #6: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #7: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #8: 0x00000001506714c8 libjax_common.dylib`PjitFunction_tp_dealloc + 504 frame #9: 0x0000000106420dd8 libpython3.12.dylib`PyDict_DelItem + 668 frame #10: 0x00000001063cbc38 libpython3.12.dylib`_PyEval_EvalFrameDefault + 26328 frame #11: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #12: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #13: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #14: 0x0000000150676a4c libjax_common.dylib`jax::ArgumentSignature::~ArgumentSignature() + 172 frame #15: 0x0000000150679c88 libjax_common.dylib`jax::CallSignature::~CallSignature() + 456 frame #16: 0x0000000150679fb0 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 112 frame #17: 0x0000000150672400 libjax_common.dylib`jax::(anonymous namespace)::PjitFunctionCache::Clear() + 44 frame #18: 0x000000015067a340 libjax_common.dylib`_object* nanobind::detail::func_create<false, true, void nanobind::cpp_function_def<jax::(anonymous namespace)::PjitFunctionCache, void, jax::(anonymous namespace)::PjitFunctionCache, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(void (jax::(anonymous namespace)::PjitFunctionCache::*)(), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(jax::(anonymous namespace)::PjitFunctionCache*), void, jax::(anonymous namespace)::PjitFunctionCache*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(jax::(anonymous namespace)::PjitFunctionCache&&, void (*)(nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self), std::__1::integer_sequence<unsigned long, 0ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 80 frame #19: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #20: 0x00000001063ed3a8 libpython3.12.dylib`_PyEval_EvalFrameDefault + 163400 frame #21: 0x00000001064e3450 libpython3.12.dylib`atexit_callfuncs.llvm.13196908868581062239 + 96 frame #22: 0x00000001064ece28 libpython3.12.dylib`Py_FinalizeEx + 96 frame #23: 0x000000010650366c libpython3.12.dylib`Py_Exit + 20 frame #24: 0x000000010650364c libpython3.12.dylib`handle_system_exit + 32 frame #25: 0x0000000106503330 libpython3.12.dylib`_PyErr_PrintEx.llvm.12194046240795210664 + 52 frame #26: 0x000000010650de00 libpython3.12.dylib`_PyRun_SimpleFileObject + 464 frame #27: 0x00000001065051e4 libpython3.12.dylib`_PyRun_AnyFileObject + 80 frame #28: 0x00000001065045a0 libpython3.12.dylib`pymain_run_file_obj + 164 frame #29: 0x0000000106503c00 libpython3.12.dylib`pymain_run_file + 72 frame #30: 0x0000000106501e04 libpython3.12.dylib`Py_RunMain + 1120 frame #31: 0x0000000106501808 libpython3.12.dylib`pymain_main + 456 frame #32: 0x0000000106501634 libpython3.12.dylib`Py_BytesMain + 36 frame #33: 0x00000001951fab98 dyld`start + 6076 ``` PiperOrigin-RevId: 787935136
copybara-service bot
pushed a commit
that referenced
this pull request
Jul 29, 2025
Fixes jax-ml/jax#30517 Example traceback of crash: ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10) * frame #0: 0x0000000150679fe4 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 164 frame #1: 0x000000015067dc20 libjax_common.dylib`std::__1::__shared_ptr_emplace<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>, std::__1::allocator<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>>>::__on_zero_shared() + 32 frame #2: 0x000000015067a0e4 libjax_common.dylib`std::__1::unique_ptr<jax::(anonymous namespace)::PjitFunctionCache::Value, std::__1::default_delete<jax::(anonymous namespace)::PjitFunctionCache::Value>>::reset[abi:ne180100](jax::(anonymous namespace)::PjitFunctionCache::Value*) + 104 frame #3: 0x000000015067e230 libjax_common.dylib`_object* nanobind::detail::func_create<true, true, jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1, void, nanobind::handle, 0ul>(jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1&&, void (*)(nanobind::handle), std::__1::integer_sequence<unsigned long, 0ul>)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 336 frame #4: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #5: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #6: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #7: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #8: 0x00000001506714c8 libjax_common.dylib`PjitFunction_tp_dealloc + 504 frame #9: 0x0000000106420dd8 libpython3.12.dylib`PyDict_DelItem + 668 frame #10: 0x00000001063cbc38 libpython3.12.dylib`_PyEval_EvalFrameDefault + 26328 frame #11: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #12: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #13: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #14: 0x0000000150676a4c libjax_common.dylib`jax::ArgumentSignature::~ArgumentSignature() + 172 frame #15: 0x0000000150679c88 libjax_common.dylib`jax::CallSignature::~CallSignature() + 456 frame #16: 0x0000000150679fb0 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 112 frame #17: 0x0000000150672400 libjax_common.dylib`jax::(anonymous namespace)::PjitFunctionCache::Clear() + 44 frame #18: 0x000000015067a340 libjax_common.dylib`_object* nanobind::detail::func_create<false, true, void nanobind::cpp_function_def<jax::(anonymous namespace)::PjitFunctionCache, void, jax::(anonymous namespace)::PjitFunctionCache, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(void (jax::(anonymous namespace)::PjitFunctionCache::*)(), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(jax::(anonymous namespace)::PjitFunctionCache*), void, jax::(anonymous namespace)::PjitFunctionCache*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(jax::(anonymous namespace)::PjitFunctionCache&&, void (*)(nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self), std::__1::integer_sequence<unsigned long, 0ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 80 frame #19: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #20: 0x00000001063ed3a8 libpython3.12.dylib`_PyEval_EvalFrameDefault + 163400 frame #21: 0x00000001064e3450 libpython3.12.dylib`atexit_callfuncs.llvm.13196908868581062239 + 96 frame #22: 0x00000001064ece28 libpython3.12.dylib`Py_FinalizeEx + 96 frame #23: 0x000000010650366c libpython3.12.dylib`Py_Exit + 20 frame #24: 0x000000010650364c libpython3.12.dylib`handle_system_exit + 32 frame #25: 0x0000000106503330 libpython3.12.dylib`_PyErr_PrintEx.llvm.12194046240795210664 + 52 frame #26: 0x000000010650de00 libpython3.12.dylib`_PyRun_SimpleFileObject + 464 frame #27: 0x00000001065051e4 libpython3.12.dylib`_PyRun_AnyFileObject + 80 frame #28: 0x00000001065045a0 libpython3.12.dylib`pymain_run_file_obj + 164 frame #29: 0x0000000106503c00 libpython3.12.dylib`pymain_run_file + 72 frame #30: 0x0000000106501e04 libpython3.12.dylib`Py_RunMain + 1120 frame #31: 0x0000000106501808 libpython3.12.dylib`pymain_main + 456 frame #32: 0x0000000106501634 libpython3.12.dylib`Py_BytesMain + 36 frame #33: 0x00000001951fab98 dyld`start + 6076 ``` PiperOrigin-RevId: 787935136
copybara-service bot
pushed a commit
that referenced
this pull request
Jul 29, 2025
Fixes jax-ml/jax#30517 Example traceback of crash: ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10) * frame #0: 0x0000000150679fe4 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 164 frame #1: 0x000000015067dc20 libjax_common.dylib`std::__1::__shared_ptr_emplace<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>, std::__1::allocator<xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>>>::__on_zero_shared() + 32 frame #2: 0x000000015067a0e4 libjax_common.dylib`std::__1::unique_ptr<jax::(anonymous namespace)::PjitFunctionCache::Value, std::__1::default_delete<jax::(anonymous namespace)::PjitFunctionCache::Value>>::reset[abi:ne180100](jax::(anonymous namespace)::PjitFunctionCache::Value*) + 104 frame #3: 0x000000015067e230 libjax_common.dylib`_object* nanobind::detail::func_create<true, true, jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1, void, nanobind::handle, 0ul>(jax::(anonymous namespace)::PjitFunctionCache::Lookup(xla::nb_class_ptr<jax::(anonymous namespace)::PjitFunctionCache>, nanobind::handle, nanobind::object)::$_1&&, void (*)(nanobind::handle), std::__1::integer_sequence<unsigned long, 0ul>)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 336 frame #4: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #5: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #6: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #7: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #8: 0x00000001506714c8 libjax_common.dylib`PjitFunction_tp_dealloc + 504 frame #9: 0x0000000106420dd8 libpython3.12.dylib`PyDict_DelItem + 668 frame #10: 0x00000001063cbc38 libpython3.12.dylib`_PyEval_EvalFrameDefault + 26328 frame #11: 0x0000000106356620 libpython3.12.dylib`PyObject_CallOneArg + 116 frame #12: 0x0000000106421144 libpython3.12.dylib`PyObject_ClearWeakRefs + 340 frame #13: 0x0000000106377f78 libpython3.12.dylib`func_dealloc + 352 frame #14: 0x0000000150676a4c libjax_common.dylib`jax::ArgumentSignature::~ArgumentSignature() + 172 frame #15: 0x0000000150679c88 libjax_common.dylib`jax::CallSignature::~CallSignature() + 456 frame #16: 0x0000000150679fb0 libjax_common.dylib`xla::LRUCache<jax::CallSignature, std::__1::shared_ptr<jax::(anonymous namespace)::PjitCacheEntry>, absl::lts_20250127::hash_internal::Hash<jax::CallSignature>, std::__1::equal_to<jax::CallSignature>>::Clear() + 112 frame #17: 0x0000000150672400 libjax_common.dylib`jax::(anonymous namespace)::PjitFunctionCache::Clear() + 44 frame #18: 0x000000015067a340 libjax_common.dylib`_object* nanobind::detail::func_create<false, true, void nanobind::cpp_function_def<jax::(anonymous namespace)::PjitFunctionCache, void, jax::(anonymous namespace)::PjitFunctionCache, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(void (jax::(anonymous namespace)::PjitFunctionCache::*)(), nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(jax::(anonymous namespace)::PjitFunctionCache*), void, jax::(anonymous namespace)::PjitFunctionCache*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self>(jax::(anonymous namespace)::PjitFunctionCache&&, void (*)(nanobind::scope, nanobind::name, nanobind::is_method, nanobind::lock_self), std::__1::integer_sequence<unsigned long, 0ul>, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&, nanobind::lock_self const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) + 80 frame #19: 0x000000015394ba58 libjax_common.dylib`nanobind::detail::nb_func_vectorcall_simple_1(_object*, _object* const*, unsigned long, _object*) + 156 frame #20: 0x00000001063ed3a8 libpython3.12.dylib`_PyEval_EvalFrameDefault + 163400 frame #21: 0x00000001064e3450 libpython3.12.dylib`atexit_callfuncs.llvm.13196908868581062239 + 96 frame #22: 0x00000001064ece28 libpython3.12.dylib`Py_FinalizeEx + 96 frame #23: 0x000000010650366c libpython3.12.dylib`Py_Exit + 20 frame #24: 0x000000010650364c libpython3.12.dylib`handle_system_exit + 32 frame #25: 0x0000000106503330 libpython3.12.dylib`_PyErr_PrintEx.llvm.12194046240795210664 + 52 frame #26: 0x000000010650de00 libpython3.12.dylib`_PyRun_SimpleFileObject + 464 frame #27: 0x00000001065051e4 libpython3.12.dylib`_PyRun_AnyFileObject + 80 frame #28: 0x00000001065045a0 libpython3.12.dylib`pymain_run_file_obj + 164 frame #29: 0x0000000106503c00 libpython3.12.dylib`pymain_run_file + 72 frame #30: 0x0000000106501e04 libpython3.12.dylib`Py_RunMain + 1120 frame #31: 0x0000000106501808 libpython3.12.dylib`pymain_main + 456 frame #32: 0x0000000106501634 libpython3.12.dylib`Py_BytesMain + 36 frame #33: 0x00000001951fab98 dyld`start + 6076 ``` PiperOrigin-RevId: 788532094
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Move python/lib/core/bfloat16.h to TSL
Part of OpenXLA refactoring