-
Notifications
You must be signed in to change notification settings - Fork 11.8k
ggml-cuda : perform cublas mat mul of quantized types as f16 #3412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Can’t test atm, but if ppl looks ok we should merge
Perplexity looks good:
|
Will this murder P40? Also, what if I am running a model on 3090s and also P40s together? |
This is only used on Volta and up. |
Right but what happens if one gpu is pascal and one GPU is ampere? Will it go with the lowest cuda version for all? |
This was already only used on the main GPU, but I think that even that may not work properly when converting |
I updated the A100 numbers using this PR: #3359 |
This increases VRAM usage for some reason. With this build and using --nommap my q4K_S model no longer fits in VRAM and it slows down dramatically. Edit: Apologies, I misread. I was confusing the new mul mat kernels (MMQ) with MMAP. So the higher VRAM usage is expected. The difference MMQ makes is dramatic in my case:
Hopefully this change can be made to work with mmq as well. |
Once support for tensor cores is added to mmq, it will be as fast or faster than cublas again, while still using less VRAM. For now, cublas is the only way to use tensor cores. |
Just ran NSight and can confirm the tensor cores are, for the first time ever, used to their full extent. Awesome work! Now fingers crossed its easy to enable tensor core support for mmq as well. If mmq (which was a lot faster than cublas before this commit) can benefit from TC support as well, then we are definately in for another revolution here. Exciting stuff! |
Would be interesting to see how the changes affect AMD users |
…example * 'master' of github.com:ggerganov/llama.cpp: ggml-cuda : perform cublas mat mul of quantized types as f16 (ggml-org#3412) llama.cpp : add documentation about rope_freq_base and scale values (ggml-org#3401) train : fix KQ_pos allocation (ggml-org#3392) llama : quantize up to 31% faster on Linux and Windows with mmap (ggml-org#3206) readme : update hot topics + model links (ggml-org#3399) readme : add link to grammars app (ggml-org#3388) swift : fix build on xcode 15 (ggml-org#3387) build : enable more non-default compiler warnings (ggml-org#3200) ggml_tensor: update the structure comments. (ggml-org#3283) ggml : release the requested thread pool resource (ggml-org#3292) llama.cpp : split llama_context_params into model and context params (ggml-org#3301) ci : multithreaded builds (ggml-org#3311) train : finetune LORA (ggml-org#2632) gguf : basic type checking in gguf_get_* (ggml-org#3346) gguf : make token scores and types optional (ggml-org#3347) ci : disable freeBSD builds due to lack of VMs (ggml-org#3381) llama : custom attention mask + parallel decoding + no context swaps (ggml-org#3228) docs : mark code as Bash (ggml-org#3375) readme : add Mistral AI release 0.1 (ggml-org#3362) ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggml-org#3370)
~2 weeks ago I did a prototype implementation for mmq using tensor cores and was not able to get better performance. From what I can tell a prerequisite to getting good tensor core utilization would be to load data asynchronously. As of right now the mmq compute pipeline utilization (without tensor cores) is only ~50%. |
…g#3412) * ggml-cuda : perform cublas matrix multiplication of quantized types as fp16 * rename CC_TURING to CC_VOLTA * disable fp16 mat mul completely with multi GPU
Improves prompt processing speed with quantized types with mmq disabled only (
-nommq
).Essentially this is the same as #3370, extended to quantized types by dequantizing to fp16.
For comparison, this is the performance that I get with mmq enabled (the default):