ggml-cpu: fixes instability in NNPA Vector Intrinsics #15739

taronaeo · 2025-09-02T11:44:44Z

This Pull Request fixes #14877 and #15721 whereby compiling with -DGGML_NNPA causes gibberish output with more than 4 threads. This code change also include automatic disabling of Flash Attention when building with -DGGML_NNPA.

Verification

To ensure that the inference is now correct with -DGGML_NNPA, the following tests have been done:

threads=1 (OK)

$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 1 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384

Write me a dog walking business idea 1. 
What is the name of the business?
2. What services does it offer?
3. Who are the target

threads=2 (OK)

$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 2 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384

Write me a dog walking business idea 1. 
What is the name of the business?
2. What services does it offer?
3. Who are the target

threads=4 (OK)

$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 4 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384

Write me a dog walking business idea 1. 
What is the name of the business?
2. What services does it offer?
3. Who are the target

threads=8 (OK)

$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 8 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384

Write me a dog walking business idea 1. 
What is the name of the business?
2. What services does it offer?
3. Who are the target

threads=16 (OK)

$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 16 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384

Write me a dog walking business idea 1. 
What is the name of the business?
2. What services does it offer?
3. Who are the target

Performance Results

model	size	params	backend	threads	test	t/s master	t/s PR	speedup
granite 3B all F32	9.44 GiB	2.53 B	BLAS	4	pp256	29.87	28.5	0.95
granite 3B all F32	9.44 GiB	2.53 B	BLAS	4	tg64	1.84	1.84	1.00
granite 3B F16	4.72 GiB	2.53 B	BLAS	4	pp256	25.15	24.97	0.99
granite 3B F16	4.72 GiB	2.53 B	BLAS	4	tg64	1.47	1.92	1.31

Note

Tests were conducted on an IBM z16 Mainframe with 2 IFLs and 64 GB Memory on a shared z/VM environment.

Signed-off-by: Aaron Teo <[email protected]>

slaren · 2025-09-02T12:21:33Z

ggml/src/ggml-cpu/ggml-cpu.c

+int ggml_cpu_support_fattn(void) {
+#if defined(GGML_NNPA) || defined(__NNPA__)
+    // disable Flash Attention when using NNPA
+    // see: https://github.com/ggml-org/llama.cpp/issues/15721
+    return 0;
+#else
+    return 1;
+#endif
+}


This is not going to work. Either the problem with NNPA should be fixed, or NNPA support removed.

Signed-off-by: Aaron Teo <[email protected]>

taronaeo · 2025-09-05T10:42:11Z

I'm a little bit at a lost here. I've double checked the NNPA FP32 to FP16 code and verified that it is actually correct. The thing causing the failure in FP32->FP16 conversion is actually because ggml_compute_forward_dup_f32 is calling ggml_cpu_fp32_to_fp16 with -inf or nan data for some reason with Flash Attention turned on.

That's as far as I have gotten to identify the failure point and I think it would be easier to disable FP32->FP16 NNPA conversion for now, while we leave FP16->FP32 as-is as it is still correct and functioning.

-fa off vs -fa on -inf

-fa off vs -fa on nan

taronaeo · 2025-09-05T18:50:56Z

Superseded by #15821

taronaeo added 5 commits September 2, 2025 13:45

ggml-cpu: stabilise nnpa fp32<->fp16

14c870d

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: enable GGML_NNPA by default

0cc2017

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: switch flash attention disable to ggml-cpu

1edd6ed

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: disable fattn via ggml

b8e17f5

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: add comment for fattn disable

a59f362

Signed-off-by: Aaron Teo <[email protected]>

github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Sep 2, 2025

slaren reviewed Sep 2, 2025

View reviewed changes

taronaeo linked an issue Sep 2, 2025 that may be closed by this pull request

Eval bug: ggml-cpu Conversion FP32<->FP16 Using GGML_NNPA Stop Inferencing Correctly After b6324 #15721

Closed

taronaeo added 3 commits September 3, 2025 02:02

ggml-cpu: undo fattn override for nnpa

fde5231

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: temp disable faulty fp32<->fp16 conversion

ed91ef6

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: disable more faulty code for rework

4200bea

Signed-off-by: Aaron Teo <[email protected]>

taronaeo mentioned this pull request Sep 3, 2025

Eval bug: ggml-cpu Conversion FP32<->FP16 Using GGML_NNPA Stop Inferencing Correctly After b6324 #15721

Closed

taronaeo marked this pull request as draft September 5, 2025 09:21

taronaeo added 7 commits September 5, 2025 17:22

ggml-cpu: add more logging to detect -inf

0b3bec8

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: wip

b9ce37e

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: log dup_f32

d73c4cd

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: wip

0510f25

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: wip

fac0dbc

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: wip

dc84c2a

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: wip

f00ecaf

Signed-off-by: Aaron Teo <[email protected]>

taronaeo mentioned this pull request Sep 5, 2025

ggml-cpu: drop support for nnpa intrinsics #15821

Merged

taronaeo closed this Sep 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: fixes instability in NNPA Vector Intrinsics #15739

ggml-cpu: fixes instability in NNPA Vector Intrinsics #15739

taronaeo commented Sep 2, 2025

Uh oh!

slaren Sep 2, 2025

Uh oh!

taronaeo commented Sep 5, 2025 •

edited

Loading

Uh oh!

taronaeo commented Sep 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

ggml-cpu: fixes instability in NNPA Vector Intrinsics #15739

ggml-cpu: fixes instability in NNPA Vector Intrinsics #15739

Conversation

taronaeo commented Sep 2, 2025

Verification

Performance Results

Uh oh!

slaren Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

taronaeo commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taronaeo commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

taronaeo commented Sep 5, 2025 •

edited

Loading

taronaeo commented Sep 5, 2025 •

edited

Loading