Building ExecuTorch on RPi5 with Clang 14.0.6 fails due to bfloat incompatibility

### 🐛 Describe the bug

As discussed with @kirklandsign in [Issue  #8508](https://github.com/pytorch/executorch/issues/8508), I am opening a separate one here.

I was trying to build executorch locally on my RPi5. It worked fine using the Clang compiler (version 14.0.6) and the release/0.4 branch. Now, with the release/0.5 and main branch, I am running into the error below. I guess it is related to the Clang compiler because when I switch to g++/gcc building executorch works just fine.

```

[ 56%] Building C object backends/xnnpack/third-party/XNNPACK/CMakeFiles/microkernels-prod.dir/src/qd8-f16-qc8w-igemm/gen/qd8-f16-qc8w-igemm-1x8c2s4-minmax-neonfp16arith-mlal.c.o
  [ 56%] Building CXX object kernels/portable/CMakeFiles/portable_kernels.dir/cpu/op_addmm.cpp.o
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:80:29: error: unknown type name 'bfloat16x8_t'; did you mean 'float16x8_t'?
  f32_dot_bf16(float32x4_t a, bfloat16x8_t b, bfloat16x8_t c) {
                              ^~~~~~~~~~~~
                              float16x8_t
  /usr/lib/llvm-14/lib/clang/14.0.6/include/arm_neon.h:75:56: note: 'float16x8_t' declared here
  typedef __attribute__((neon_vector_type(8))) float16_t float16x8_t;
                                                         ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:80:45: error: unknown type name 'bfloat16x8_t'; did you mean 'float16x8_t'?
  f32_dot_bf16(float32x4_t a, bfloat16x8_t b, bfloat16x8_t c) {
                                              ^~~~~~~~~~~~
                                              float16x8_t
  /usr/lib/llvm-14/lib/clang/14.0.6/include/arm_neon.h:75:56: note: 'float16x8_t' declared here
  typedef __attribute__((neon_vector_type(8))) float16_t float16x8_t;
                                                         ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:79:1: warning: unknown architecture 'armv8.2-a+bf16' in the 'target' attribute string; 'target' attribute ignored [-Wignored-attributes]
  ET_TARGET_ARM_BF16_ATTRIBUTE static ET_INLINE float32x4_t
  ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:78:25: note: expanded from macro 'ET_TARGET_ARM_BF16_ATTRIBUTE'
    __attribute__((target("arch=armv8.2-a+bf16")))
                          ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:81:10: error: use of undeclared identifier 'vbfdotq_f32'
    return vbfdotq_f32(a, b, c);
           ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:84:1: warning: unknown architecture 'armv8.2-a+bf16' in the 'target' attribute string; 'target' attribute ignored [-Wignored-attributes]
  ET_TARGET_ARM_BF16_ATTRIBUTE static ET_INLINE void
  ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:78:25: note: expanded from macro 'ET_TARGET_ARM_BF16_ATTRIBUTE'
    __attribute__((target("arch=armv8.2-a+bf16")))
                          ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:90:9: error: unknown type name 'bfloat16x8_t'; did you mean 'float16x8_t'?
    const bfloat16x8_t temp_vec1 = vld1q_bf16(reinterpret_cast<const __bf16*>(
          ^~~~~~~~~~~~
          float16x8_t
  /usr/lib/llvm-14/lib/clang/14.0.6/include/arm_neon.h:75:56: note: 'float16x8_t' declared here
  typedef __attribute__((neon_vector_type(8))) float16_t float16x8_t;
                                                         ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:90:68: error: __bf16 is not supported on this target
    const bfloat16x8_t temp_vec1 = vld1q_bf16(reinterpret_cast<const __bf16*>(
                                                                     ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:90:34: error: use of undeclared identifier 'vld1q_bf16'
    const bfloat16x8_t temp_vec1 = vld1q_bf16(reinterpret_cast<const __bf16*>(
                                   ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:92:9: error: unknown type name 'bfloat16x8_t'; did you mean 'float16x8_t'?
    const bfloat16x8_t temp_vec2 = vld1q_bf16(reinterpret_cast<const __bf16*>(
          ^~~~~~~~~~~~
          float16x8_t
  /usr/lib/llvm-14/lib/clang/14.0.6/include/arm_neon.h:75:56: note: 'float16x8_t' declared here
  typedef __attribute__((neon_vector_type(8))) float16_t float16x8_t;
                                                         ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:92:68: error: __bf16 is not supported on this target
    const bfloat16x8_t temp_vec2 = vld1q_bf16(reinterpret_cast<const __bf16*>(
                                                                     ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:92:34: error: use of undeclared identifier 'vld1q_bf16'
    const bfloat16x8_t temp_vec2 = vld1q_bf16(reinterpret_cast<const __bf16*>(
                                   ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:119:1: warning: unknown architecture 'armv8.2-a+bf16' in the 'target' attribute string; 'target' attribute ignored [-Wignored-attributes]
  ET_TARGET_ARM_BF16_ATTRIBUTE static ET_INLINE void
  ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:78:25: note: expanded from macro 'ET_TARGET_ARM_BF16_ATTRIBUTE'
    __attribute__((target("arch=armv8.2-a+bf16")))
                          ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:150:3: warning: unknown architecture 'armv8.2-a+bf16' in the 'target' attribute string; 'target' attribute ignored [-Wignored-attributes]
    ET_TARGET_ARM_BF16_ATTRIBUTE ET_INLINE void operator()(const Func& f) const {
    ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:78:25: note: expanded from macro 'ET_TARGET_ARM_BF16_ATTRIBUTE'
    __attribute__((target("arch=armv8.2-a+bf16")))
                          ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:159:3: warning: unknown architecture 'armv8.2-a+bf16' in the 'target' attribute string; 'target' attribute ignored [-Wignored-attributes]
    ET_TARGET_ARM_BF16_ATTRIBUTE ET_INLINE void operator()(const Func& f) const {
    ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:78:25: note: expanded from macro 'ET_TARGET_ARM_BF16_ATTRIBUTE'
    __attribute__((target("arch=armv8.2-a+bf16")))
                          ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:167:1: warning: unknown architecture 'armv8.2-a+bf16' in the 'target' attribute string; 'target' attribute ignored [-Wignored-attributes]
  ET_TARGET_ARM_BF16_ATTRIBUTE float
  ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:78:25: note: expanded from macro 'ET_TARGET_ARM_BF16_ATTRIBUTE'
    __attribute__((target("arch=armv8.2-a+bf16")))
                          ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:176:33: warning: unknown architecture 'armv8.2-a+bf16' in the 'target' attribute string; 'target' attribute ignored [-Wignored-attributes]
              ET_INLINE_ATTRIBUTE ET_TARGET_ARM_BF16_ATTRIBUTE {
                                  ^
  /home/executorch_05/executorch/kernels/optimized/blas/BlasKernel.cpp:78:25: note: expanded from macro 'ET_TARGET_ARM_BF16_ATTRIBUTE'
    __attribute__((target("arch=armv8.2-a+bf16")))
                          ^
  7 warnings and 9 errors generated.
  gmake[3]: *** [kernels/optimized/CMakeFiles/cpublas.dir/build.make:121: kernels/optimized/CMakeFiles/cpublas.dir/blas/BlasKernel.cpp.o] Fehler 1
  gmake[2]: *** [CMakeFiles/Makefile2:1238: kernels/optimized/CMakeFiles/cpublas.dir/all] Fehler 2
  gmake[2]: *** Es wird auf noch nicht beendete Prozesse gewartet....
  [ 56%] Building CXX object kernels/portable/CMakeFiles/portable_kernels.dir/cpu/op_alias_copy.cpp.

```


### Versions

Collecting environment information...
PyTorch version: 2.7.0.dev20250131+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 12 (bookworm) (aarch64)
GCC version: (Debian 12.2.0-14) 12.2.0
Clang version: 14.0.6
CMake version: version 3.31.6
Libc version: glibc-2.36

Python version: 3.10.0 (default, Mar 3 2022, 09:51:40) [GCC 10.2.0] (64-bit runtime)
Python platform: Linux-6.6.74+rpt-rpi-v8-aarch64-with-glibc2.36
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
Model name: Cortex-A76
Model: 1
Thread(s) per core: 1
Core(s) per cluster: 4
Socket(s): -
Cluster(s): 1
Stepping: r4p1
CPU(s) scaling MHz: 100%
CPU max MHz: 2400,0000
CPU min MHz: 1500,0000
BogoMIPS: 108,00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
L1d cache: 256 KiB (4 instances)
L1i cache: 256 KiB (4 instances)
L2 cache: 2 MiB (4 instances)
L3 cache: 2 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] executorch==0.6.0a0+542480c
[pip3] numpy==2.2.3
[pip3] torch==2.7.0.dev20250131+cpu
[pip3] torchao==0.10.0+git7d879462
[pip3] torchaudio==2.6.0.dev20250131
[pip3] torchgen==0.0.1
[pip3] torchsr==1.0.4
[pip3] torchvision==0.22.0.dev20250131
[conda] executorch 0.6.0a0+542480c pypi_0 pypi
[conda] numpy 2.2.3 pypi_0 pypi
[conda] torch 2.7.0.dev20250131+cpu pypi_0 pypi
[conda] torchao 0.10.0+git7d879462 pypi_0 pypi
[conda] torchaudio 2.6.0.dev20250131 pypi_0 pypi
[conda] torchgen 0.0.1 pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchvision 0.22.0.dev20250131 pypi_0 pypi

cc @larryliu0820 @lucylq

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Building ExecuTorch on RPi5 with Clang 14.0.6 fails due to bfloat incompatibility #8924

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Building ExecuTorch on RPi5 with Clang 14.0.6 fails due to bfloat incompatibility #8924

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions