-
Notifications
You must be signed in to change notification settings - Fork 12.9k
ggml-cpu: fixes instability in NNPA Vector Intrinsics #15739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
ggml/src/ggml-cpu/ggml-cpu.c
Outdated
int ggml_cpu_support_fattn(void) { | ||
#if defined(GGML_NNPA) || defined(__NNPA__) | ||
// disable Flash Attention when using NNPA | ||
// see: https://github.com/ggml-org/llama.cpp/issues/15721 | ||
return 0; | ||
#else | ||
return 1; | ||
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not going to work. Either the problem with NNPA should be fixed, or NNPA support removed.
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
I'm a little bit at a lost here. I've double checked the NNPA FP32 to FP16 code and verified that it is actually correct. The thing causing the failure in FP32->FP16 conversion is actually because That's as far as I have gotten to identify the failure point and I think it would be easier to disable FP32->FP16 NNPA conversion for now, while we leave FP16->FP32 as-is as it is still correct and functioning. |
Superseded by #15821 |
This Pull Request fixes #14877 and #15721 whereby compiling with
-DGGML_NNPA
causes gibberish output with more than 4 threads. This code change also include automatic disabling of Flash Attention when building with-DGGML_NNPA
.Verification
To ensure that the inference is now correct with
-DGGML_NNPA
, the following tests have been done:Performance Results
Note
Tests were conducted on an IBM z16 Mainframe with 2 IFLs and 64 GB Memory on a shared z/VM environment.