ARM: Fixes and additions to CPU feature detection #14049

ckastner · 2025-06-06T11:31:09Z

Working with the ggml-cpu ARM backend, I noticed that feature detection was not entirely complete.

This improves detection for FP16_VECTOR_ARITHMETIC, and adds support for SVE2.

Note that I had no way to test the __APPLE__ implementation for querying FP16_VECTOR_ARITHMETIC, I just used the sysctl name that I found in a web search.

ericcurtin · 2025-06-07T14:42:17Z

ggml/src/ggml-cpu/ggml-cpu.c

@@ -3449,6 +3469,14 @@ int ggml_cpu_has_dotprod(void) {
 #endif
 }

+int ggml_cpu_has_fp16_va(void) {
+#if defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)


I wonder why we do a second ifdef here when this variable is set to 0 or 1 elsewhere.

See my comment below, case 1. The way I see it, during runtime detection the host may report that the CPU supports this feature, but if we disabled it at compilation, we want the function to always return 0.

Copilot

Pull Request Overview

Adds support for ARM SVE2 feature detection and refines FP16 vector arithmetic checks in the ggml CPU backend.

Introduce has_sve2 flag and expose ggml_cpu_has_sve2() in the public API
Enhance runtime (sysctl) and compile-time detection for FP16_VECTOR_ARITHMETIC
Insert SVE2 into the backend feature list

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
ggml/src/ggml-cpu/ggml-cpu.cpp	Pushes “SVE2” into the feature vector
ggml/src/ggml-cpu/ggml-cpu.c	Adds `has_fp16_va` and `has_sve2` fields; updates runtime/sysctl & compile-time checks
ggml/include/ggml-cpu.h	Declares `ggml_cpu_has_sve2()`

Comments suppressed due to low confidence (3)

ggml/src/ggml-cpu/ggml-cpu.cpp:562

You’ve added SVE2 detection here but no tests were introduced to verify this flag under different CPU configurations. Consider adding unit tests or CI checks to cover both presence and absence of SVE2.

if (ggml_cpu_has_sve2()) {

ggml/src/ggml-cpu/ggml-cpu.cpp:559

The PR description mentions improved FP16_VECTOR_ARITHMETIC detection, but this feature isn't added to the features vector. Add a corresponding features.push_back({ "FP16_VECTOR_ARITHMETIC", "1" }); entry when ggml_cpu_has_fp16_va() returns true.

static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t r

ggml/src/ggml-cpu/ggml-cpu.cpp:562

[nitpick] Indentation here doesn't match the surrounding if statements (extra spaces). Align this block to the existing code style for consistency.

        if (ggml_cpu_has_sve2()) {

Copilot · 2025-06-07T14:46:12Z

ggml/src/ggml-cpu/ggml-cpu.c

@@ -689,8 +691,10 @@ static void ggml_init_arm_arch_features(void) {

    ggml_arm_arch_features.has_neon    = !!(hwcap & HWCAP_ASIMD);
    ggml_arm_arch_features.has_dotprod = !!(hwcap & HWCAP_ASIMDDP);
+    ggml_arm_arch_features.has_fp16_va = !!(hwcap & HWCAP_FPHP);


The macro HWCAP_FPHP looks like a typo; the standard HWCAP for FP16 support is usually HWCAP_FP16. Verify and correct this macro to ensure proper runtime detection.

Suggested change

ggml_arm_arch_features.has_fp16_va = !!(hwcap & HWCAP_FPHP);

ggml_arm_arch_features.has_fp16_va = !!(hwcap & HWCAP_FP16);

Copilot · 2025-06-07T14:46:13Z

ggml/src/ggml-cpu/ggml-cpu.c

+    ggml_arm_arch_features.has_dotprod = 0;
+#endif
+
+#if defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)


The compile-time block for has_fp16_va may override the sysctl result unconditionally (even on Apple). Wrap these macros in an #else of the Apple-specific sysctl branch to avoid conflicting assignments.

slaren

There is no reason to check for a feature that is not supported in the code. The current runtime detection in ggml_arm_arch_features is not working and should be removed, or adapted into a system similar to what we use for x86-64.

ckastner · 2025-06-07T14:56:10Z

I have another change queued up that implements GGML_CPU_ALL_VARIANTS for ARM, which would be based on this.

The current runtime detection in ggml_arm_arch_features is not working

Can you expand on that? It works fine on my end, though I've only done basic smoke testing so far.

slaren · 2025-06-07T14:59:30Z

The problem is that using intrinsics requires enabling support for the instruction set in the compiler, and this may cause the compiler to emit these instructions even in code that doesn't use intrinsics, e.g. for auto-vectorization. For this reason we cannot rely on this type of runtime dispatching when using intrinsics.

ckastner · 2025-06-07T17:16:09Z

Ah, I think with "using" you mean general in the general sense, eg: someone decides to branch on ggml_cpu_has_dotprod() at runtime, right? Whereas I was only thinking of this in the specific use case of calculating a score for an ALL_VARIANTS backend.

I would then alter this PR to implement something like cpu-feats-x86.cpp but I still think I'm missing something:

The problem is that using intrinsics requires enabling support for the instruction set in the compiler

Isn't this happening right now anyway? During compilation, GGML_NATIVE=ON turns on all possible features supported by the host. So given this:

int ggml_cpu_has_dotprod(void) {                                                                                               
#if defined(__ARM_ARCH) && defined(__ARM_FEATURE_DOTPROD)
    return ggml_arm_arch_features.has_dotprod;
#else
    return 0;
#endif
}

We have three cases:

DOTPROD is not enabled at compilation -> 0
DOTPROD is enabled at compilation and getauxval() reports that the current CPU supports it -> 1
DOTPROD is enabled at compilation but the current CPU does not support it -> 0

And 1+2 are OK but in 3. there is the problem you mention, that "his may cause the compiler to emit these instructions even in code that doesn't use intrinsics", right?

If so, then it seems that this could only be a problem when a binary was compiled NATIVE for some host, and transferred to a "lesser" host, which wouldn't make practical sense (that's not native). And right now we only have 1-N backends built specifically for some CPU, there is no single "universal binary".

In the context of GGML_CPU_ALL_VARIANTS, I don't think even 3. would make a difference. On a particular CPU, the scoring function would choose the backend that was configured/built just as GGML_NATIVE=ON would have done on it.

A bit much but again, just to understand what the intention is. I'll be needing to do the same for PowerPC so I'd like to get everything right.

slaren · 2025-06-07T17:42:26Z

Yes exactly, you got everything right. Just to reiterate: some code relies on the feature detection to determine which instruction set to use, for example:

llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Line 2197 in 5787b5d

if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {

While this works for inline assembly, it doesn't work for intrinsics and thus the method is flawed, and should be removed. The best way to implement runtime dispatching for Arm would be to implement support for GGML_CPU_ALL_VARIANTS by adding a file similar to cpu-feats-x86.cpp that computes the score for Arm, and defining a list of variants to build. The current feature detection code could be used as a starting point to implement the score function.

ckastner · 2025-06-07T17:52:37Z

Excellent, thank you for the clarification. You've anticipated my next question: is it OK to start a cpu-feats-aarch64.cpp based on some of the current feature detection code.

I see how the problem in ggml-cpu-aarch64.cpp, I somehow missed this.

chaxu01 · 2025-06-09T06:43:09Z

I have another change queued up that implements GGML_CPU_ALL_VARIANTS for ARM, which would be based on this.
@ckastner Interesting to see that you start working on this. I've also worked on this and will submit the PR sometime this week. Hope we could coordinate this to avoid any duplicate effort.

ckastner · 2025-06-09T07:25:35Z

I have another change queued up that implements GGML_CPU_ALL_VARIANTS for ARM, which would be based on this.
@ckastner Interesting to see that you start working on this. I've also worked on this and will submit the PR sometime this week. Hope we could coordinate this to avoid any duplicate effort.

I'm afraid I already finished this yesterday, just filed it as #14080.

ckastner · 2025-06-09T07:28:25Z

As the proposed alternative implementation does not necessitate fixes to the existing ARM feature detection, I consider this PR obsolete, superseded by #14080. Therefore retracting.

zhouwg · 2025-06-09T07:51:55Z

I have another change queued up that implements GGML_CPU_ALL_VARIANTS for ARM, which would be based on this.
@ckastner Interesting to see that you start working on this. I've also worked on this and will submit the PR sometime this week. Hope we could coordinate this to avoid any duplicate effort.

I'm afraid I already finished this yesterday, just filed it as #14080.

hope your PR can be approved although I can see there is another potential implementation from a regular employee from ARM.
best wishes for you.

ckastner · 2025-06-09T08:15:31Z

hope your PR can be approved although I can see there is another potential implementation from a regular employee from ARM. best wishes for you.

I think the conclusion of the above discussion is that implementing GGML_CPU_ALL_VARIANTS doesn't really touch ARM, it's just a question of how the build is produced. This is what #14080 does. Its scoring function only makes syscalls to query support from the OS, the rest is just cmake changes.

The parts that actually touch ARM still need their work (see comment).

zhouwg · 2025-06-09T08:36:54Z

I see.

the ARM and other SoC vendors has many undocumented tech docs/libs and they know everything about ARM-based SoC and dedicated chips.

BTW, FYI:https://github.com/nihui/ruapu

ckastner added 3 commits June 6, 2025 12:57

ARM: add missing flags to fallback feature detection

360b147

ARM: Dynamically query FP16_VECTOR_ARITHMETIC support

9cc914c

ARM: Add support for SVE2 feature detection

4669477

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 6, 2025

ericcurtin reviewed Jun 7, 2025

View reviewed changes

ericcurtin approved these changes Jun 7, 2025

View reviewed changes

ericcurtin requested a review from Copilot June 7, 2025 14:43

Copilot AI reviewed Jun 7, 2025

View reviewed changes

slaren requested changes Jun 7, 2025

View reviewed changes

ckastner mentioned this pull request Jun 9, 2025

Implement GGML_CPU_ALL_VARIANTS for ARM #14080

Merged

ckastner closed this Jun 9, 2025

ckastner deleted the arm-feat-fixes branch June 9, 2025 15:05

ckastner mentioned this pull request Jun 19, 2025

Implement GGML_CPU_ALL_VARIANTS for PowerPC #14286

Merged

	ggml_arm_arch_features.has_fp16_va = !!(hwcap & HWCAP_FPHP);
	ggml_arm_arch_features.has_fp16_va = !!(hwcap & HWCAP_FP16);

ARM: Fixes and additions to CPU feature detection #14049

ARM: Fixes and additions to CPU feature detection #14049

Uh oh!

Conversation

ckastner commented Jun 6, 2025

Uh oh!

ericcurtin Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

ckastner Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

ckastner commented Jun 7, 2025

Uh oh!

slaren commented Jun 7, 2025

Uh oh!

ckastner commented Jun 7, 2025

Uh oh!

slaren commented Jun 7, 2025

Uh oh!

ckastner commented Jun 7, 2025

Uh oh!

chaxu01 commented Jun 9, 2025

Uh oh!

ckastner commented Jun 9, 2025

Uh oh!

ckastner commented Jun 9, 2025

Uh oh!

zhouwg commented Jun 9, 2025

Uh oh!

ckastner commented Jun 9, 2025

Uh oh!

zhouwg commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zhouwg commented Jun 9, 2025 •

edited

Loading