Skip to content

PowerPC: Enable MMA for BF16 in llamafile_sgemm #13148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 2, 2025

Conversation

shalinib-ibm
Copy link
Contributor

This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type.

This change results in 9x - 40x gains
in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark.

The patch is tested with Meta-Lllama-3-8B,
and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine.

Make sure to read the contributing guidelines before submitting a PR

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 28, 2025
@shalinib-ibm
Copy link
Contributor Author

@ggerganov Can you please review this PR and provide your comments ?

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a machine to test this, but at least fix the indentation of the code and we can merge it.

@shalinib-ibm shalinib-ibm force-pushed the main_bf16_sgemm branch 3 times, most recently from c6c14fa to b9c6af2 Compare May 2, 2025 06:20
This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type.

This change results in 9x - 40x gains
in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark.

The patch is tested with Meta-Lllama-3-8B,
and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine.

Signed-off-by: Shalini Salomi Bodapati <[email protected]>
@shalinib-ibm
Copy link
Contributor Author

I don't have a machine to test this, but at least fix the indentation of the code and we can merge it.

Thank you @ggerganov . I have fixed the code indent. Can you please review ?

@shalinib-ibm shalinib-ibm requested a review from ggerganov May 2, 2025 16:39
@ggerganov ggerganov merged commit 3f3769b into ggml-org:master May 2, 2025
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants