-
Notifications
You must be signed in to change notification settings - Fork 11.8k
llama : add DeepSeek-v2-Chat support #7118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
That would be awesome. |
Impressive model, and potentially a CPU friendly model(if you have >96GB memory) |
@ggerganov I'd be very interested in helping, I want to get into porting models to inference engines Would you be so kind to provide a rough outline of what needs to be done here? I'd then submit a draft PR and ask for little details that don't work |
Interesting - can we get a rundown of the multi-head latent KV cache technique: @SinanAkkoyun Look at PRs that have already been merged and add support for new model arches |
Sure thing. Here's their tech report: https://github.com/deepseek-ai/DeepSeek-V2/blob/main/deepseek-v2-tech-report.pdf |
Thanks, very cool work! Adding this to the roadmap to give it more visibility |
+1 |
I'm working on it right now: https://youtu.be/1AG-GUtDvaw |
@fairydreaming Oh wow how awesome!! How does the ppl look? |
@SinanAkkoyun At this moment it's somewhat high (Q8_0):
|
This is normal for non-base models |
Would love to see support for the smaller MoE models. They seem to be good and only use 2.5b active parameters for token generation. |
You can try my branch if you want: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2
|
There is this PR from a while ago: #4093 Though DS2 seems to not use the "GPT-NeoX RoPE" as we call it, so probably not relevant
How many are the parameters? I don't think we have a better solution than adding them to the GGUF header |
@ggerganov here they are:
What do you think? |
I think it's fine to add those parameters |
The difference in YaRN RoPE that I noticed is that llama.cpp scales sin and cos values with mscale calculated like this:
while DeepSeek-V2 tensorflow implementation uses the following code:
where yarn_get_mscale is:
It uses the same calculation like llama.cpp, but twice - first for self.mscale (which is 0.707 in the config.json), then for self.mscale_all_dim (which is also 0.707 in the config.json) and then divides the first calculated value by the second. However, this will be 1.0 since both mscales are the same. In DeepSeek-V2 vLLM implementation they also do this. There's even a comment:
In the DeepSeek-V2 paper there is: "Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy", but I'm not sure if they are talking about the difference I noticed. |
Hm, that's strange - what's the point of multiplying by |
@CyberTimon I added support for the lite model in my branch, you can try it out now if you want: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v2 |
@ggerganov I think YaRN also affects calculation of sin/cos frequencies (theta variable), so we can't simply disable it. Anyway, I found another quirk of the DeepSeek-V2 - it uses a scalar value to scale the expert weights instead of normalizing them. After taking it into account perplexity looks much better in the chat model (Q8_0):
Of course it will require another parameter to be added to the model headers. |
https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite :P A model for everyone to test |
The MLA approach can probably be combined with the Pyramid KV cache - https://arxiv.org/abs/2405.12532 |
Is the main branch code now able to support DeepseekV2 inference? |
@DirtyKnightForVi I have limited knowledge of Windows, but I guess there is some disk swap mechanism in use. |
@fairydreaming I'am running it on Ubuntu. And CPU offload maybe the reason why it works well. @ggerganov This might be a default setting. But are there other configurations that can fully load my CPU or GPU? I’m quite curious about the origin of this setting. |
@DirtyKnightForVi It's because you run it with context size (n_ctx) set to 512, while on my machine it was set to default training context size value of 163840. |
Linux is ok. |
On Linux on my workstation it crashes too, seems to be related to changes in chat template support in #8068. |
Can someone please explain why this implementation runs significantly slower compared to a dense model with same active parameter count? |
@llmlover Could you provide some data? I don't think there are significant differences. Here is my result (I am using chatllm.cpp. Performance on CPU should be similar to llama.cpp) using the classic prompt "write a quick sort function in python". CodeGemma 2.5B (Q8_0): timings: prompt eval time = 352.95 ms / 16 tokens ( 22.06 ms per token, 45.33 tokens per second)
timings: eval time = 8555.32 ms / 139 tokens ( 61.55 ms per token, 16.25 tokens per second)
timings: total time = 8908.27 ms / 155 tokens DeepSeek-v2-Chat, 2.7B active (Q8_0): timings: prompt eval time = 1182.08 ms / 15 tokens ( 78.81 ms per token, 12.69 tokens per second)
timings: eval time = 16229.16 ms / 224 tokens ( 72.45 ms per token, 13.80 tokens per second)
timings: total time = 17411.25 ms / 239 tokens Yes, it is slower, but not that significant. |
@foldl Thank you for running the tests |
You are talking about "prompt eval time"? It's slower in DeepSeekCoder because the model file is significant larger than CodeGemma 2.5B. The reported time is affected by model loading. If you measure a new round, "prompt eval time" becomes much shorter. All in all I don't think it is significantly slower than a same-sized dense model. |
Anyone figured out the |
@LostRuins How did you get that error (in details if possible)? |
Model used is https://huggingface.co/mradermacher/DeepSeek-Coder-V2-Lite-Instruct-GGUF/blob/main/DeepSeek-Coder-V2-Lite-Instruct.Q3_K_S.gguf I just cloned the repo, used w64devkit to make and ran with llama-cli. Here are my full logs of these 3 steps: |
Worth noting that the CI builds which were made with MSVC do not seem to have this issue. Also this is sort of off-topic, but looking at my logs again there seems to be... a type IQ4_NL used inside a Q3_K_S? Is that intentional or a bug? @ikawrakow (apologies if i missed something, but it just stood out as weird, it does break it for me separately since some backends like vulkan don't support IQ quants, and the larger K quants work fine.) |
I confirm the problem, but it's not llama.cpp's fault. For some reason C++ standard library used in mingw (it's a part of w64devkit) is unable to convert certain unicode characters. For example this doesn't work:
So you have to report this bug in mingw, not in llama.cpp, maybe they will have some idea about how to fix it. |
@LostRuins I tried replacing std::wstring_convert with a custom function to avoid this problem, but another problem appeared later, this time with std::wregex. |
That's unfortunate. I trying googling but all I could find were some mentions of setting |
@foldl No, they are most likely referring to deepseek coder v2: gemma:2b: Immense performance difference. Why is that the case, I am curious as well? |
(ignore eval count tokens, it makes no difference at that scale, I tested it) |
@foldl @SinanAkkoyun The performance difference is most likey caused by the fact that there are much more operations in attention implementation of DeepSeek-V2 compared to Gemma. When Gemma calculates query, key and value vectors it does the following tensor operations (there are 35 tokens in the example):
while DeepSeek-V2 does the following in the same part of the model layer (there are 17 tokens in the example):
DeepSeek-V2 uses MLA (Multi-head Latent Attention) which means that are some additional tensor operations in this part of the model. Other than that, there are also CONT operations making tensors contiguous in memory (as implementations of most CUDA tensor operations require that input tensors are contiguous in memory), these operations also have a performance penalty. |
@fairydreaming Thank you a lot for your clear explanation! I was also wondering how OAI manages (despite specular decoding) to make GPT4 run so fast wirh presumably 200B active parameters. Also, is it possible to optimize DeepSeekCoderV2 even further or are the 80 tps the practical limit of this architecture today? |
Yes, this is not an optimal implementation, it simply required the least amount of work. It definitely can be optimized further. |
@fairydreaming Thank you for the insight. Based on your intuition, what could the performance gain look like in TPS and how much work would it require? I might know some people who would be interested in taking this on if the potential improvement justifies the effort |
@SinanAkkoyun I think you should profile the CUDA implementation first to identify likely bottlenecks. A wise man once said: premature optimization is the root of all evil. My intuition mumbles something about a few percent of improvement, but I wouldn't rely on it. |
@fairydreaming Thank you very very much, I appreciate your comments! |
@fairydreaming vLLM just dropped DS Coder V2 support Could it be that some bigger gains could still be made for Llama.cpp? I am not knowledgeable enough to assess the implementation but perhaps something went unnoticed, I can not imagine that vLLMs paged attention makes much difference for 40 generated tokens |
@SinanAkkoyun I don't know, maybe. |
I just ran my farel-bench benchmark on updated DeepSeek-V2 and the score is amazing! It's better than any other open-weights model! This is also confirmed by the ZebraLogic benchmark. So if you still use the older model I think it's wise to update. By the way I got almost the same scores by running the Q8_0 quant locally (score of 87.78) in llama.cpp and by using the openrouter API (score of 87.56) so implementation of this model seems to be in a good shape in llama.cpp. |
@Ishankhan21 you are getting out of memory errors because you run the model without setting context size (which results in a default value of 163840 being used), try adding for example -c 4096 to your llama.cpp command-line options |
please support deepseek-ai/DeepSeek-V2-Chat
https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat
The text was updated successfully, but these errors were encountered: