Skip to content

CUDA: faster multi GPU synchronization #2448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

This PR replaces the memcpy in a loop for the synchronization between GPUs with a single memcpy2D call. For prompt processing this is much faster:

GPU Model Test t/s 1x P40 t/s master t/s PR Speedup
3x P40 7b q4_0 tg128 47.41 43.45 43.53 1
3x P40 13b q4_0 tg128 26.30 29.83 29.89 1
3x P40 33b q4_0 tg128 11.51 15.37 15.38 1
3x P40 7b q4_0 pp 454.89 122.02 346.80 2.84
3x P40 13b q4_0 pp 258.17 85.05 206.76 2.43
3x P40 33b q4_0 pp 104.97 46.42 99.17 2.14

For small models multiple fast GPUs seem to still be slower than a single fast GPU due to the synchronization overhead.

@JohannesGaessler JohannesGaessler merged commit 9baf9ef into ggml-org:master Jul 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants