Skip to content

Missing CPU features attributes on dispatch functions lead to UB / missed target instructions #16670

@bjacob

Description

@bjacob

Testcase: just a i8 x i8 -> i32 matmul:

func.func @matmul_dynamic(%lhs: tensor<?x?xi8>, %rhs: tensor<?x?xi8>, %acc: tensor<?x?xi32>) -> tensor<?x?xi32> {
  %result = linalg.matmul ins(%lhs, %rhs: tensor<?x?xi8>, tensor<?x?xi8>) outs(%acc: tensor<?x?xi32>) -> tensor<?x?xi32>
  return %result: tensor<?x?xi32>
}

Reproduce:

tools/iree-compile \
  ~/matmul_i8.mlir -o /tmp/a.vmfb \
  --iree-hal-target-backends=llvm-cpu \
  --iree-llvmcpu-target-cpu=znver4 \
  --iree-llvmcpu-enable-ukernels=all \
  --iree-hal-dump-executable-intermediates-to=/tmp \
  -mlir-disable-threading \
  -mlir-print-ir-after-all \
  2>/tmp/log

Inspection of the generated assembly /tmp/module_matmul_i8_linked_llvm_cpu_embedded_elf_x86_64.s shows that baseline AVX-512 code is generated (VPMADDWD) instead of the expected AVX-512-VNNI code (VPDPWSSD):

matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x2_i8xi8xi32:
[...]
	vshufi64x2	$27, %zmm16, %zmm16, %zmm19
	vpmaddwd	%zmm16, %zmm21, %zmm24
	vpmaddwd	%zmm17, %zmm21, %zmm26
	vpmaddwd	%zmm18, %zmm21, %zmm25
	vpmaddwd	%zmm19, %zmm21, %zmm21
[...]

Why? The dumped intermediates show that all the way to the post-linking optimized IR (/tmp/module_matmul_i8_linked_llvm_cpu_embedded_elf_x86_64.optimized.ll), it was the expected AVX-512-VNNI intrinsic function:

define internal noundef i32 @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x2_i8xi8xi32(ptr noalias nocapture nonnull readonly align 16 %0, ptr noalias nocapture nonnull readonly align 16 %1, ptr noalias nocapture nonnull readonly align 16 %2) #1 !dbg !90 {
[...]
  %358 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %334, <16 x i32> %354, <16 x i32> %347), !dbg !91
  %359 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %333, <16 x i32> %354, <16 x i32> %348), !dbg !91
  %360 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %332, <16 x i32> %354, <16 x i32> %349), !dbg !91
  %361 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %331, <16 x i32> %354, <16 x i32> %350), !dbg !91
  %362 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %330, <16 x i32> %355, <16 x i32> %347), !dbg !91
  %363 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %329, <16 x i32> %355, <16 x i32> %348), !dbg !91
[...]

But wait, what is that attribute #1 on that function? Does it have the required CPU feature enabled? Nope:

attributes #1 = { nofree norecurse nosync nounwind "frame-pointer"="all" "hot" "no-builtins" "nonlazybind" }

So our code here is Undefined Behavior, and indeed, while initially minimizing it with llc, I did run into should-not-get-here crashes in x86 instruction selection. And in our current e2e IREE use case, the Undefined Behavior, while not crashing or affecting correctness, is still causing us to miss the intended VNNI instruction.

"Of course" this dispatch function doesn't have the required +avx512vnni CPU feature attribute, since we never put it there. The only functions that have the +avx512vnni CPU feature attribute are the ukernel internal VNNI implementation functions, which are compiled with this CPU feature enabled in the first place.

I guess I was expecting the attribute to be propagated from callee to caller as the VNNI inner tile function gets inlined first into iree_uk_mmt4d and then into the dispatch function. It's not.

How do we resolve that in a way that doesn't violate the design with target specialization in LLVMCPUTarget ? @benvanik

Metadata

Metadata

Assignees

No one assigned

    Labels

    performance ⚡Performance/optimization related work across the compiler and runtime

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions