Support for architecture DeepseekV2ForCausalLM #512

RodriMora · 2024-06-17T14:38:55Z

Hi!

When trying to quantize the new Deepseel Coder V2 https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct I got the following error:

 !! Warning, unknown architecture: DeepseekV2ForCausalLM
 !! Loading as LlamaForCausalLM

Would it be possible to add support?

The text was updated successfully, but these errors were encountered:

turboderp · 2024-06-17T15:03:04Z

This is the same issue as #443.

The implementation uses shared experts, which would have to be added to the implementation. Since I don't have the hardware to actually run the model (quantized or otherwise), I'd be developing remotely, which is slow and awkward compared to local development where I have a debugger, profiler and all these other tools available. Not to mention, it would be expensive for the kind of server that I would need. Something like $100/day, with some big changes (i.e. quite a few days) needed due to the difficulty of calibrating a model with 162 experts per layer.

In the end I don't think there are that many people who could even run the model anyway, or afford to host it, so it seems like a waste of my time. Especially as it's the kind of model that really screams for CPU inference (using llama.cpp or whatever). You could build a fairly cheap CPU server with 256 GB of RAM and probably get quite reasonable speeds that way, since it's a sparse model.

RodriMora · 2024-06-17T15:25:17Z

Thanks a lot to take the time for the explanaition.

How about the "Lite" version?

https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct

Is a 15B parameter, same architecture

turboderp · 2024-06-17T23:20:12Z

Oh, I didn't know there was a smaller version. That does look more realistic. It would still need a lot of new code, so I'm not sure when exactly I can get to it. But definitely doable.

nktice · 2024-06-19T01:26:50Z

There are two small ones, one's aforementioned instruct, the other is base - https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Base

nktice · 2024-06-21T23:17:48Z

@turboderp - Regarding hardware, do you know of Matt Berman? He vlogs on AI - https://www.youtube.com/@matthew_berman - I emailed with him, it sounds like he may know folks who can help you out, like cloud providers, and some hardware companies, and would love to help. His contact info is there on the YouTube page, feel free to reach out.

matbee-eth · 2024-06-24T18:37:22Z

Wellp guess I have to stick with some slow-mo gguf

sammcj · 2024-06-25T08:03:01Z

FYI - The lite instruct model is amazingly good, easily the best coding model I've used, better than codestral and much faster (even on GGUF) but would really benefit from exllamav2's long-context and KV efficiency.

I have 1x 3090 (24GB) and 2x A4000 (2x16GB) if you need me to test anything / run some builds feel free to @ me or contact via profile.

RodriMora · 2024-06-25T08:04:49Z

I have 4x3090, EPYC 48core and 512GB ram system and could provide access too if needed.

turboderp · 2024-06-25T20:52:19Z

Ultimately it's not hardware I need, it's time. I have no doubt that it's the best-model-ever, but so were Yi, Orion, Gemma, Starcoder, GemMoE, Cohere, DBRX, Phi... Granite? All the time I spent on those architectures may or may not have been worth it, but it definitely took time away from other improvements I'd like to make, and there are core aspects of the library that really need attention as well. And with every new architecture I implement just in time for everyone to become disenchanted with it, I also add technical debt.

So yeah, I'm hesitant. 🤷 Maybe. Just not right now, unless someone wants to contribute some code.

sammcj · 2024-06-25T20:54:27Z

Totally understandable. Thank you for the response. There will be other great models in the future :)

16x3b · 2025-01-14T13:36:54Z

@turboderp Sorry to bring this up again, but I was wondering if its even remotely possible to merge Ktranformers with exllamav2. That way you could run larger models, but benefit from exllama's optimizations. Ktranformers lets you store some of the experts on dram. I'm not sure if that is compatible with exllama.

Not sure if possible or even provides any value, just a thought. Also not suggesting you do it alone, if you can at least consider if it is conceptually feasible and worth doing, I would be willing to put the time into attempting to write the code.

grimulkan · 2025-02-14T22:43:07Z

Anyone know if Deepseek R1 also uses shared experts and has the same issue for Exllama support as V2?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support for architecture DeepseekV2ForCausalLM #512

Support for architecture DeepseekV2ForCausalLM #512

RodriMora commented Jun 17, 2024

turboderp commented Jun 17, 2024

Uh oh!

RodriMora commented Jun 17, 2024

Uh oh!

turboderp commented Jun 17, 2024

Uh oh!

nktice commented Jun 19, 2024

Uh oh!

nktice commented Jun 21, 2024

Uh oh!

matbee-eth commented Jun 24, 2024

Uh oh!

sammcj commented Jun 25, 2024

Uh oh!

RodriMora commented Jun 25, 2024

Uh oh!

turboderp commented Jun 25, 2024

Uh oh!

sammcj commented Jun 25, 2024

Uh oh!

16x3b commented Jan 14, 2025

Uh oh!

grimulkan commented Feb 14, 2025

Uh oh!

Uh oh!

Support for architecture DeepseekV2ForCausalLM #512

Support for architecture DeepseekV2ForCausalLM #512

Comments

RodriMora commented Jun 17, 2024

turboderp commented Jun 17, 2024

Uh oh!

RodriMora commented Jun 17, 2024

Uh oh!

turboderp commented Jun 17, 2024

Uh oh!

nktice commented Jun 19, 2024

Uh oh!

nktice commented Jun 21, 2024

Uh oh!

matbee-eth commented Jun 24, 2024

Uh oh!

sammcj commented Jun 25, 2024

Uh oh!

RodriMora commented Jun 25, 2024

Uh oh!

turboderp commented Jun 25, 2024

Uh oh!

sammcj commented Jun 25, 2024

Uh oh!

16x3b commented Jan 14, 2025

Uh oh!

grimulkan commented Feb 14, 2025

Uh oh!