-
Notifications
You must be signed in to change notification settings - Fork 739
Description
Testcase: just a i8 x i8 -> i32
matmul:
func.func @matmul_dynamic(%lhs: tensor<?x?xi8>, %rhs: tensor<?x?xi8>, %acc: tensor<?x?xi32>) -> tensor<?x?xi32> {
%result = linalg.matmul ins(%lhs, %rhs: tensor<?x?xi8>, tensor<?x?xi8>) outs(%acc: tensor<?x?xi32>) -> tensor<?x?xi32>
return %result: tensor<?x?xi32>
}
Reproduce:
tools/iree-compile \
~/matmul_i8.mlir -o /tmp/a.vmfb \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-cpu=znver4 \
--iree-llvmcpu-enable-ukernels=all \
--iree-hal-dump-executable-intermediates-to=/tmp \
-mlir-disable-threading \
-mlir-print-ir-after-all \
2>/tmp/log
Inspection of the generated assembly /tmp/module_matmul_i8_linked_llvm_cpu_embedded_elf_x86_64.s
shows that baseline AVX-512 code is generated (VPMADDWD) instead of the expected AVX-512-VNNI code (VPDPWSSD):
matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x2_i8xi8xi32:
[...]
vshufi64x2 $27, %zmm16, %zmm16, %zmm19
vpmaddwd %zmm16, %zmm21, %zmm24
vpmaddwd %zmm17, %zmm21, %zmm26
vpmaddwd %zmm18, %zmm21, %zmm25
vpmaddwd %zmm19, %zmm21, %zmm21
[...]
Why? The dumped intermediates show that all the way to the post-linking optimized IR (/tmp/module_matmul_i8_linked_llvm_cpu_embedded_elf_x86_64.optimized.ll
), it was the expected AVX-512-VNNI intrinsic function:
define internal noundef i32 @matmul_dynamic_dispatch_3_mmt4d_DxDxDx16x16x2_i8xi8xi32(ptr noalias nocapture nonnull readonly align 16 %0, ptr noalias nocapture nonnull readonly align 16 %1, ptr noalias nocapture nonnull readonly align 16 %2) #1 !dbg !90 {
[...]
%358 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %334, <16 x i32> %354, <16 x i32> %347), !dbg !91
%359 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %333, <16 x i32> %354, <16 x i32> %348), !dbg !91
%360 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %332, <16 x i32> %354, <16 x i32> %349), !dbg !91
%361 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %331, <16 x i32> %354, <16 x i32> %350), !dbg !91
%362 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %330, <16 x i32> %355, <16 x i32> %347), !dbg !91
%363 = tail call <16 x i32> @llvm.x86.avx512.vpdpwssd.512(<16 x i32> %329, <16 x i32> %355, <16 x i32> %348), !dbg !91
[...]
But wait, what is that attribute #1
on that function? Does it have the required CPU feature enabled? Nope:
attributes #1 = { nofree norecurse nosync nounwind "frame-pointer"="all" "hot" "no-builtins" "nonlazybind" }
So our code here is Undefined Behavior, and indeed, while initially minimizing it with llc
, I did run into should-not-get-here crashes in x86 instruction selection. And in our current e2e IREE use case, the Undefined Behavior, while not crashing or affecting correctness, is still causing us to miss the intended VNNI instruction.
"Of course" this dispatch function doesn't have the required +avx512vnni
CPU feature attribute, since we never put it there. The only functions that have the +avx512vnni
CPU feature attribute are the ukernel internal VNNI implementation functions, which are compiled with this CPU feature enabled in the first place.
I guess I was expecting the attribute to be propagated from callee to caller as the VNNI inner tile function gets inlined first into iree_uk_mmt4d
and then into the dispatch function. It's not.
How do we resolve that in a way that doesn't violate the design with target specialization in LLVMCPUTarget ? @benvanik