perf: Add FP16 GEMM MMUL Reshaped Only Rhs Support #1181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ArmDude wants to merge 1 commit into main from pr/g1-kernel

Contributor

ArmDude commented Sep 16, 2025

This patch introduces a GEMM routine that is optimized for Arm(R) Mali(TM)-G1

Resolves: [COMPMID-8311], [COMPMID-8312]

Change-Id: I84e685f0314da9af1c3fbb50d83e68b355727770

Dongsung-arm self-requested a review

September 16, 2025 13:26

Dongsung-arm commented Sep 16, 2025

Looks fine to me

Dongsung-arm previously approved these changes

View reviewed changes

ArmDude dismissed Dongsung-arm’s stale review via

25eea17

September 16, 2025 13:47

ArmDude force-pushed the pr/g1-kernel branch from 9181c67 to 25eea17 Compare

September 16, 2025 13:47

gunes-arm requested changes

View reviewed changes

src/gpu/cl/kernels/ClGemmMatrixMultiplyReshapedOnlyRhsMMULKernel.cpp Outdated Show resolved Hide resolved

src/gpu/cl/kernels/ClGemmMatrixMultiplyReshapedOnlyRhsMMULKernel.cpp Outdated

+                  const bool         is_fp16 = (src0->data_type() == DataType::F16);
+                  // These error messages are for FP16 acc.
+                  ARM_COMPUTE_RETURN_ERROR_ON_MSG(is_fp16 && (n < rhs_info.n0 * mmul_n0), "N must be greater that N0 * MMUL_N0");

Contributor

gunes-arm Sep 17, 2025

First suggestion: Put the fp16 related validations into a if(is_fp16) block and remove is_fp16 from every check.

Also, the message should be specific to fp16 kernel, e.g.
"K must be multiple of 4 in fp16 mmul kernel".

Contributor Author

ArmDude Sep 18, 2025

How about this? (same change as for the comment before it)

src/gpu/cl/kernels/ClGemmMatrixMultiplyReshapedOnlyRhsMMULKernel.cpp

+                  const unsigned int m       = gemm_info.m;
+                  const unsigned int n       = gemm_info.n;
+                  const unsigned int k       = gemm_info.k;
+                  const bool         is_fp16 = (src0->data_type() == DataType::F16);

Contributor

gunes-arm Sep 17, 2025

We check all these validations for fp16, but at this point, we do not know whether we'll be using fp16 mmul kernel. So, we might be validating against all these for the fp32 mmul kernel running on fp16 input, which is wrong, right?

Contributor Author

ArmDude Sep 18, 2025

How about this?

src/gpu/cl/kernels/ClGemmMatrixMultiplyReshapedOnlyRhsMMULKernel.cpp Outdated Show resolved Hide resolved

src/gpu/cl/kernels/ClGemmMatrixMultiplyReshapedOnlyRhsMMULKernel.cpp Outdated Show resolved Hide resolved

tests/validation/CL/GEMMMatrixMultiplyReshapedOnlyRhsMMUL.cpp Outdated

               TEST_SUITE_END() // FP32
-              TEST_SUITE(FP16)
+              TEST_SUITE(MMUL_FP16)

Contributor

gunes-arm Sep 17, 2025

The fp32 mmul kernel still supports Fp16 although the heuristics don't choose it (same goes for block sizes), therefore we shouldn't remove this test.

Also, we should replace the combine(combine(... patterns with a single combine, i.e. combine(...) wherever we touch.

Contributor Author

ArmDude Sep 19, 2025

How about this?

tests/validation/CL/GEMMMatrixMultiplyReshapedOnlyRhsMMUL.cpp Outdated

               /** N values to test */
-              const auto n_values = framework::dataset::make("N", {257});
+              const auto n_values              = framework::dataset::make("N", {257});
+              const auto n_values_fp16         = framework::dataset::make("N", {79});

Contributor

gunes-arm Sep 17, 2025

I think one of the mistakes we did was to test on a single shape. While we test different block sizes on a single shape, we should also test on small shapes and a subset of block sizes. (While keeping in mind the test time required ofc.)

Contributor Author

ArmDude Sep 18, 2025

How about this? These are values from a number of models of interest, given the combinatorial expansion, I think 3 is a reasonable number of values for each of the datasets. I've tried to pick shapes on the smaller side to keep runtime reasonable too.

tests/validation/CL/GEMMMatrixMultiplyReshapedOnlyRhsMMUL.cpp Outdated

               /** K0 values to test - Precommit */
               const auto k0_values_precommit = framework::dataset::make("K0", { 1 });
               /** Broadcast bias from vector to matrix */
-              const auto broadcast_bias_values = framework::dataset::make("broadcast_bias", { false, true } );
+              const auto broadcast_bias_values = framework::dataset::make("broadcast_bias", { false } );

Contributor

gunes-arm Sep 17, 2025

This change reduces fp32 tests as well

tests/validation/CL/GEMMMatrixMultiplyReshapedOnlyRhsMMUL.cpp Outdated

                       framework::ARM_COMPUTE_PRINT_INFO();
                   }
               }
-              TEST_SUITE_END() // FP16
+              TEST_SUITE(ExportToCLImage)

Contributor

gunes-arm Sep 17, 2025

If there will be any, they should be under the TEST_SUITE(ExportToCLImage) below

Contributor Author

ArmDude Sep 18, 2025

How's this?

tests/validation/CL/GEMMMatrixMultiplyReshapedOnlyRhsMMUL.cpp

+                                                                                 act_values))
+              {
+                  // Validate output
+                  if(validate_result)

Contributor

gunes-arm Sep 17, 2025

I think we've done this the wrong way 4 years ago, and the culprit is me :) We shouldn't do the same mistake here. If you look at GEMMFixture.h and the relevant fixture class, this is set to false when the gemm or reshape does not validate to true. So, if we provide a faulty configuration to test, it'll skip the test. This is definitely not the right thing to do.

I hope the context is clear. Let me know if it's not. Now, what do we need to do?:

We change GEMMMatrixMultiplyReshapedOnlyRhsMMULValidationFixture so that validate_result is set to false only if the hardware features relevant for the test in question are not supported.

i.e. We have fp32 and fp16 tests using mmul with fp32 accumulators and we have fp16 tests using mmul with fp16 accumulators.

Contributor Author

ArmDude Sep 19, 2025

I've made a change to GEMMFixture.h to the setting of this value, now making it correspond to whether the target hardware supports arm_matrix_multiply.

dwildmark reviewed

View reviewed changes

arm_compute/core/CL/cl_definitions.h Show resolved Hide resolved

ArmDude force-pushed the pr/g1-kernel branch from 25eea17 to 5f4e692 Compare

September 19, 2025 14:43


          perf: Add FP16 GEMM MMUL Reshaped Only Rhs Support

5e9d919

This patch introduces a GEMM routine that is optimized for Arm(R) Mali(TM)-G1

Resolves: [COMPMID-8311], [COMPMID-8312]

Signed-off-by: Omar Al Khatib <[email protected]>
Change-Id: I84e685f0314da9af1c3fbb50d83e68b355727770

ArmDude force-pushed the pr/g1-kernel branch from 5f4e692 to 5e9d919 Compare

September 19, 2025 15:09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet