Skip to content

Conversation

OuadiElfarouki
Copy link
Contributor

This minor arithmetic change (a%b = a - b*(a/b)) -apparently missed by icpx/icx compiler- brings a considerable performance improvement [5% to 35%] to type-1 quantized models (e.g. QK_4* and QK_5_*)_ in text generation when targetting Nvidia devices, without scratching performance of the other supported quantizations AND devices.

Tested devices for this matter :

  • Nvidia : A100-40GB, A4000, RTX 4060
  • intel : intel Data Center GPU Max 1100, intel Arc A770

Compiler : oneAPI 2024.1.0

Thanks for reviewing this @AidanBeltonS @abhilash1910 @airMeng @NeoZhangJianyu

@JohannesGaessler
Copy link
Collaborator

Context: I have written most of the MMVQ CUDA code which to my understanding is what the SYCL code is adapted from.

I don't understand why this change is faster. qi and vdr are powers of 2 that are both known at compile time so their ratio is also a power of 2 known at compile time. A number modulo a power of 2 should be converted by the compiler to just a binary and which should be faster than a subtraction.

Copy link
Collaborator

@abhilash1910 abhilash1910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me.
However I am also confused as to how subtraction beats modulo as the later should be efficient . For Intel hardware , is there a significant boost as in Nvidia ?
If this is something which looks promising then such changes need to be addressed in dpct and icpx .
Also pinging @ggerganov for a look when available.

@mofosyne mofosyne added the performance Speed related topics label May 9, 2024
@OuadiElfarouki
Copy link
Contributor Author

Context: I have written most of the MMVQ CUDA code which to my understanding is what the SYCL code is adapted from.

I don't understand why this change is faster. qi and vdr are powers of 2 that are both known at compile time so their ratio is also a power of 2 known at compile time. A number modulo a power of 2 should be converted by the compiler to just a binary and which should be faster than a subtraction.

@JohannesGaessler Agree it should be handled by the compiler, in fact a similar change to equivalent operations in other "hot spots" didn't result in such a leap in performance. So reasons seem unclear atm but the numbers might justify the patch.

@OuadiElfarouki
Copy link
Contributor Author

OuadiElfarouki commented May 9, 2024

This looks ok to me. However I am also confused as to how subtraction beats modulo as the later should be efficient . For Intel hardware , is there a significant boost as in Nvidia ? If this is something which looks promising then such changes need to be addressed in dpct and icpx . Also pinging @ggerganov for a look when available.

@abhilash1910 Yes it's odd indeed, performance leap were mainly noticed on Nvidia GPUs (some examples below), no to very little improvement on intel targets though.

Some LLama2 13B examples (patchPerf-masterPerf)/masterPerf):

  • Nvidia A4000

<style type="text/css"></style>

Q4_K_M 36%
Q4_K_S 52%
Q5_K_M 38%
Q5_K_S 52%
  • Nvidia A100 :

<style type="text/css"></style>

Q4_K_M 62%
Q4_K_S 70%
Q5_K_M 57%
Q5_K_S 43%

@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label May 9, 2024
@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented May 10, 2024

It's working on Intel iGPU. About 3% increase.
It's good!

@NeoZhangJianyu NeoZhangJianyu merged commit 8c570c9 into ggml-org:master May 10, 2024
joeatodd added a commit that referenced this pull request Jun 17, 2024
joeatodd added a commit that referenced this pull request Jun 18, 2024
#7172)" (#7980)

* Revert "Minor arithmetic improvement to mmvq wrapper kernel (#7172)"

This reverts commit 8c570c9.

* Update ggml-sycl.cpp
Alcpz pushed a commit to Alcpz/llama.cpp that referenced this pull request Jun 20, 2024
ggml-org#7172)" (ggml-org#7980)

* Revert "Minor arithmetic improvement to mmvq wrapper kernel (ggml-org#7172)"

This reverts commit 8c570c9.

* Update ggml-sycl.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants