-
Notifications
You must be signed in to change notification settings - Fork 12k
Eval bug: DeepSeek-R1-UD-Q2_K_XL output broken #13305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Might be related to the issues many of us who are doing partial offloading are having since the MLA commit. For me the gibberish began with the MLA commit, but I was able to somewhat workaround it by enabling Q8_0 k cache. I see you have Q4_0 cache enabled so perhaps you are only just hitting it now for whatever reason with newer changes. |
Calculating perplexity on Wikitext using Deepseek V2 Lite q4_0 with max. GPU layers:
With
With
With
With
Without a GPU at all:
So it does look like there are issues with partial offloading. |
Oh yes I was supposed to bring the discussion over from https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/discussions/2 - thanks @joesixpaq for the extensive tests. I think the MLA commit might still be having some issues - the new MMQ commit @JohannesGaessler made I think should be fixed with some small fixes, but unsure if there are some interactions going on. The only thing I can confirm for now is full GPU offloading (ie all layers) seem to work OK - most people are having gibberish outputs when CPU offloading is done |
The problem disappeared when I specified --no-kv-offload in this first broken commit e1e8e09. In another try, I additionally offloaded 5 layers to GPU (max for RTX 3090), and it worked as well. |
It works with the latest commit 66645a5 and --no-kv-offload. /home/ai/workspace/LLAMA_CPP/66645a5285d8c4c5f9a3b3f360d042baac2d820a/llama.cpp/build/bin/llama-server system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CUDA : ARCHS = 500,610,700,750,800 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | main: binding port with default address family <|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>' |
@danielhanchen since you're already here, can I recruit you to help with investigating #13287 (comment) ? |
I could try, but I doubt I could be helpful! I've only recently started coding on the outer layers of llama.cpp, but not any internals |
It seems to work with a BF16 model. Maybe some padding that is not cleared correctly when copying the quantized tensors to VRAM? |
Yes, I already figured out that the issue specifically occurs for the combination of MMQ and |
What I meant is, since you are already running extensive tests with perplexity and KL divergence, could you check whether my changes have made the model predictions worse in a statistically significant way? |
Should be fixed by #13320 . |
This fixed the issues I’ve been having. Thank you very much. |
@JohannesGaessler I was just checking all your new changes - great work as usual! imatrix.cpp sadly still gets errors with: CUDA error: invalid configuration argument
current device: 0, in function ggml_cuda_mul_mat_id at llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2062
cudaGetLastError() ie at: get_rows_cuda(src1->data, src1->type, ids_to_sorted, src1_sorted.ptr, type_src1_sorted,
ne10, nb11, nb12, nb13,
ne_get_rows, 1, 1, sizeof(int32_t), ne_get_rows*sizeof(int32_t), ne_get_rows*sizeof(int32_t),
ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, stream);
CUDA_CHECK(cudaGetLastError()); Using My suspicion is because CUDA I think requires arguments to be
Ie one of the arguments: ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted is exceeding I'll make a new issue! |
Uh oh!
There was an error while loading. Please reload this page.
Name and Version
I experience gibberish with DeepSeek-R1-UD-Q2_K_XL by unsloth (checked with SHA256)
In my case, this gibberish output started with e1e8e09.
I eventually managed to isolate the latest still working commit: 6f67cf1
The most recent tested commit which is still not working is 9f2da58
Operating systems
Linux
GGML backends
CUDA
Hardware
1x RTX 3090, Intel Xeon E5-2640 v3, 1TB RAM
Models
DeepSeek-R1-UD-Q2_K_XL by unsloth
Problem description & steps to reproduce
Senseless output with partially Chinese characters
First Bad Commit
e1e8e09
Relevant log output
The text was updated successfully, but these errors were encountered: