Skip to content

ARM: Fixes and additions to CPU feature detection #14049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

ckastner
Copy link
Collaborator

@ckastner ckastner commented Jun 6, 2025

Working with the ggml-cpu ARM backend, I noticed that feature detection was not entirely complete.

This improves detection for FP16_VECTOR_ARITHMETIC, and adds support for SVE2.

Note that I had no way to test the __APPLE__ implementation for querying FP16_VECTOR_ARITHMETIC, I just used the sysctl name that I found in a web search.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 6, 2025
@@ -3449,6 +3469,14 @@ int ggml_cpu_has_dotprod(void) {
#endif
}

int ggml_cpu_has_fp16_va(void) {
#if defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we do a second ifdef here when this variable is set to 0 or 1 elsewhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below, case 1. The way I see it, during runtime detection the host may report that the CPU supports this feature, but if we disabled it at compilation, we want the function to always return 0.

@ericcurtin ericcurtin requested a review from Copilot June 7, 2025 14:43
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for ARM SVE2 feature detection and refines FP16 vector arithmetic checks in the ggml CPU backend.

  • Introduce has_sve2 flag and expose ggml_cpu_has_sve2() in the public API
  • Enhance runtime (sysctl) and compile-time detection for FP16_VECTOR_ARITHMETIC
  • Insert SVE2 into the backend feature list

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
ggml/src/ggml-cpu/ggml-cpu.cpp Pushes “SVE2” into the feature vector
ggml/src/ggml-cpu/ggml-cpu.c Adds has_fp16_va and has_sve2 fields; updates runtime/sysctl & compile-time checks
ggml/include/ggml-cpu.h Declares ggml_cpu_has_sve2()
Comments suppressed due to low confidence (3)

ggml/src/ggml-cpu/ggml-cpu.cpp:562

  • You’ve added SVE2 detection here but no tests were introduced to verify this flag under different CPU configurations. Consider adding unit tests or CI checks to cover both presence and absence of SVE2.
if (ggml_cpu_has_sve2()) {

ggml/src/ggml-cpu/ggml-cpu.cpp:559

  • The PR description mentions improved FP16_VECTOR_ARITHMETIC detection, but this feature isn't added to the features vector. Add a corresponding features.push_back({ "FP16_VECTOR_ARITHMETIC", "1" }); entry when ggml_cpu_has_fp16_va() returns true.
static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t r

ggml/src/ggml-cpu/ggml-cpu.cpp:562

  • [nitpick] Indentation here doesn't match the surrounding if statements (extra spaces). Align this block to the existing code style for consistency.
        if (ggml_cpu_has_sve2()) {

@@ -689,8 +691,10 @@ static void ggml_init_arm_arch_features(void) {

ggml_arm_arch_features.has_neon = !!(hwcap & HWCAP_ASIMD);
ggml_arm_arch_features.has_dotprod = !!(hwcap & HWCAP_ASIMDDP);
ggml_arm_arch_features.has_fp16_va = !!(hwcap & HWCAP_FPHP);
Copy link
Preview

Copilot AI Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macro HWCAP_FPHP looks like a typo; the standard HWCAP for FP16 support is usually HWCAP_FP16. Verify and correct this macro to ensure proper runtime detection.

Suggested change
ggml_arm_arch_features.has_fp16_va = !!(hwcap & HWCAP_FPHP);
ggml_arm_arch_features.has_fp16_va = !!(hwcap & HWCAP_FP16);

Copilot uses AI. Check for mistakes.

ggml_arm_arch_features.has_dotprod = 0;
#endif

#if defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
Copy link
Preview

Copilot AI Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compile-time block for has_fp16_va may override the sysctl result unconditionally (even on Apple). Wrap these macros in an #else of the Apple-specific sysctl branch to avoid conflicting assignments.

Copilot uses AI. Check for mistakes.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason to check for a feature that is not supported in the code. The current runtime detection in ggml_arm_arch_features is not working and should be removed, or adapted into a system similar to what we use for x86-64.

@ckastner
Copy link
Collaborator Author

ckastner commented Jun 7, 2025

I have another change queued up that implements GGML_CPU_ALL_VARIANTS for ARM, which would be based on this.

The current runtime detection in ggml_arm_arch_features is not working

Can you expand on that? It works fine on my end, though I've only done basic smoke testing so far.

@slaren
Copy link
Member

slaren commented Jun 7, 2025

The problem is that using intrinsics requires enabling support for the instruction set in the compiler, and this may cause the compiler to emit these instructions even in code that doesn't use intrinsics, e.g. for auto-vectorization. For this reason we cannot rely on this type of runtime dispatching when using intrinsics.

@ckastner
Copy link
Collaborator Author

ckastner commented Jun 7, 2025

Ah, I think with "using" you mean general in the general sense, eg: someone decides to branch on ggml_cpu_has_dotprod() at runtime, right? Whereas I was only thinking of this in the specific use case of calculating a score for an ALL_VARIANTS backend.

I would then alter this PR to implement something like cpu-feats-x86.cpp but I still think I'm missing something:

The problem is that using intrinsics requires enabling support for the instruction set in the compiler

Isn't this happening right now anyway? During compilation, GGML_NATIVE=ON turns on all possible features supported by the host. So given this:

int ggml_cpu_has_dotprod(void) {                                                                                               
#if defined(__ARM_ARCH) && defined(__ARM_FEATURE_DOTPROD)
    return ggml_arm_arch_features.has_dotprod;
#else
    return 0;
#endif
}

We have three cases:

  1. DOTPROD is not enabled at compilation -> 0
  2. DOTPROD is enabled at compilation and getauxval() reports that the current CPU supports it -> 1
  3. DOTPROD is enabled at compilation but the current CPU does not support it -> 0

And 1+2 are OK but in 3. there is the problem you mention, that "his may cause the compiler to emit these instructions even in code that doesn't use intrinsics", right?

If so, then it seems that this could only be a problem when a binary was compiled NATIVE for some host, and transferred to a "lesser" host, which wouldn't make practical sense (that's not native). And right now we only have 1-N backends built specifically for some CPU, there is no single "universal binary".

In the context of GGML_CPU_ALL_VARIANTS, I don't think even 3. would make a difference. On a particular CPU, the scoring function would choose the backend that was configured/built just as GGML_NATIVE=ON would have done on it.

A bit much but again, just to understand what the intention is. I'll be needing to do the same for PowerPC so I'd like to get everything right.

@slaren
Copy link
Member

slaren commented Jun 7, 2025

Yes exactly, you got everything right. Just to reiterate: some code relies on the feature detection to determine which instruction set to use, for example:

if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {

While this works for inline assembly, it doesn't work for intrinsics and thus the method is flawed, and should be removed. The best way to implement runtime dispatching for Arm would be to implement support for GGML_CPU_ALL_VARIANTS by adding a file similar to cpu-feats-x86.cpp that computes the score for Arm, and defining a list of variants to build. The current feature detection code could be used as a starting point to implement the score function.

@ckastner
Copy link
Collaborator Author

ckastner commented Jun 7, 2025

Excellent, thank you for the clarification. You've anticipated my next question: is it OK to start a cpu-feats-aarch64.cpp based on some of the current feature detection code.

I see how the problem in ggml-cpu-aarch64.cpp, I somehow missed this.

@chaxu01
Copy link
Collaborator

chaxu01 commented Jun 9, 2025

I have another change queued up that implements GGML_CPU_ALL_VARIANTS for ARM, which would be based on this.
@ckastner Interesting to see that you start working on this. I've also worked on this and will submit the PR sometime this week. Hope we could coordinate this to avoid any duplicate effort.

@ckastner
Copy link
Collaborator Author

ckastner commented Jun 9, 2025

I have another change queued up that implements GGML_CPU_ALL_VARIANTS for ARM, which would be based on this.
@ckastner Interesting to see that you start working on this. I've also worked on this and will submit the PR sometime this week. Hope we could coordinate this to avoid any duplicate effort.

I'm afraid I already finished this yesterday, just filed it as #14080.

@ckastner
Copy link
Collaborator Author

ckastner commented Jun 9, 2025

As the proposed alternative implementation does not necessitate fixes to the existing ARM feature detection, I consider this PR obsolete, superseded by #14080. Therefore retracting.

@ckastner ckastner closed this Jun 9, 2025
@zhouwg
Copy link
Contributor

zhouwg commented Jun 9, 2025

I have another change queued up that implements GGML_CPU_ALL_VARIANTS for ARM, which would be based on this.
@ckastner Interesting to see that you start working on this. I've also worked on this and will submit the PR sometime this week. Hope we could coordinate this to avoid any duplicate effort.

I'm afraid I already finished this yesterday, just filed it as #14080.

hope your PR can be approved although I can see there is another potential implementation from a regular employee from ARM.
best wishes for you.

@ckastner
Copy link
Collaborator Author

ckastner commented Jun 9, 2025

hope your PR can be approved although I can see there is another potential implementation from a regular employee from ARM. best wishes for you.

I think the conclusion of the above discussion is that implementing GGML_CPU_ALL_VARIANTS doesn't really touch ARM, it's just a question of how the build is produced. This is what #14080 does. Its scoring function only makes syscalls to query support from the OS, the rest is just cmake changes.

The parts that actually touch ARM still need their work (see comment).

@zhouwg
Copy link
Contributor

zhouwg commented Jun 9, 2025

I see.

the ARM and other SoC vendors has many undocumented tech docs/libs and they know everything about ARM-based SoC and dedicated chips.

BTW, FYI:https://github.com/nihui/ruapu

@ckastner ckastner deleted the arm-feat-fixes branch June 9, 2025 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants