-
Notifications
You must be signed in to change notification settings - Fork 11.8k
multi-gpu inference produces broken output #3772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Did it work before? |
Yes! I just tested different commits to narrow down the issue: Multi-gpu inference has worked fine even on 8 GPUs until (including) 8b428c9. It seems that from 111163e something has broken (@JohannesGaessler). |
I cannot reproduce the issue using 3x P40. Are you running llama.cpp inside a virtual machine or WSL? |
Nope, the linux system and llama.cpp are directly accessing the hardware, i.e. no virtualisation is involved. |
The thing is that in ggml_op_mul_mat we using cudaMemcpyD2Async to put data from active >1 gpu to main gpu, but this is possible only when gpus has cross-dma feature, which is true only for large bar pci systems. The solution is to use hip/cudaMemcpyDtoDAsync in a loop to fill slices in dst from data in src0 -- this allow to bypass crashes on devices without p2p access. But there are second sort of bug present. |
Maybe, anybody could explain what are we expecting to achive in dst and in which form? But I just cant understand math. Especially about src0 transposition -- call to ggml_is_transposed(src0) return False! @ggerganov @JohannesGaessler @slaren @FSSRepo Upd: On solo gpu: On multi-gpu: Maybe, this could lead to sigsegv later on memcpy2d. |
I’m also having this issue with 2x 4090s - it actually corrupts the model files when I use 2 GPUs. Both work fine by themselves using Tried CUDA 12.3, 12.1, rocky linux, and ubuntu. |
The STRANGEST part is that it works beautifully on my old box a dual Xeon DELL 7610 with two 1080TI and one M6000 but it produces only garbage on my newly built box an ASUS X99-e WS build with two 3090 24GB founders edition. I copied the source and recompiled with the same make LLAMA_CUBLAS=1 and on the DELL with older 1080TI and even older M6000 24GB the 13B Llama 2 produces nice output, pretty decent speed, but on the ASUS with two 3090 it produces garbage. It works if I take one of the 3090 out, but what is the point, I want to use the Q4 70B model. |
@dji-transpire Can you check the versions for CUDA, CUDNN, CUBLAS, NVIDIA driver versions, or any relevant SDKs -- were they the same? The model (generation) of GPU are already different so this also might be one factor even if everything else is the same. |
Thanks!!! You nailed it! The old box is running the 535 driver, the new box runs the latest 545 driver. Downgrading nvidia-dkms nvidia-utils and lib32-nividia--utils to 535 and putting these on the IgnorePkg list solved the issue. Now both 3090 founders edition cards play nicely with LLAMA 13B and Q4 70B. So: Be careful with the 545 version of the Nvidia driver and multiple GPUs???? |
@dji-transpire Also running into the same issue with 3x 1080Ti, running driver version 545.29.08. Which exact version of 535 did you revert to? Was it to the latest 535.129.03? |
Note: A workaround for this bug is to use the CMake flag
This will disable CUDA peer access completely and produce correct output when multiple GPUs are used. |
is there a way to make this work in textgeneration-webui without downgrading nvidia drivers? |
Any news on this? For dual 7900 XTX I'm still getting garbage with hipBLAS build, regardless of model. But on single card it works. I tried the |
@slaren yeah I know that, and I have no hope of it being fixed on AMD side soon, so I have very little hope in using pytorch with dual cards. Yet llama.cpp is much much better imo :) and flexible. I already have them working via vulkan, just mixtral on vk is still missing, but I know 0cc4m is working on it. But even without it I think llama.cpp already does some "manual workarounds" for what underlying libs do not provide, thus it that one 8x is on CPU other via chipset problem that I have, can be worked around via some slower "manual" data copying would still be nice :) . In any case for now vulkan seems like my best bet, so I'll be waiting for updates from 0cc4m :) |
Somebody with access to dual 7900 XTX would need to diagnose the issue. AFAIK nobody who is working on the CUDA/HIP backend at the moment has access to this hardware. |
Yeah, understandable :) for now I'm mostly happy with vulkan, and when mixstral is supported, I think I'll have basically no need for HIP build. Still if this somehow progresses, will also be nice to know. Thanks! |
Can you test if it works with this change? (do not use diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 04c6f5d0..06af740e 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -797,7 +797,7 @@ static ggml_backend_buffer_i ggml_backend_cuda_buffer_interface = {
/* .init_tensor = */ ggml_backend_cuda_buffer_init_tensor,
/* .set_tensor = */ ggml_backend_cuda_buffer_set_tensor,
/* .get_tensor = */ ggml_backend_cuda_buffer_get_tensor,
- /* .cpy_tensor = */ ggml_backend_cuda_buffer_cpy_tensor,
+ /* .cpy_tensor = */ NULL,//ggml_backend_cuda_buffer_cpy_tensor,
/* .clear = */ ggml_backend_cuda_buffer_clear,
/* .reset = */ NULL,
};
@@ -11584,7 +11584,7 @@ static ggml_backend_i ggml_backend_cuda_interface = {
/* .get_default_buffer_type = */ ggml_backend_cuda_get_default_buffer_type,
/* .set_tensor_async = */ ggml_backend_cuda_set_tensor_async,
/* .get_tensor_async = */ ggml_backend_cuda_get_tensor_async,
- /* .cpy_tensor_async = */ ggml_backend_cuda_cpy_tensor_async,
+ /* .cpy_tensor_async = */ NULL,//ggml_backend_cuda_cpy_tensor_async,
/* .synchronize = */ ggml_backend_cuda_synchronize,
/* .graph_plan_create = */ NULL,
/* .graph_plan_free = */ NULL,
@@ -11592,10 +11592,10 @@ static ggml_backend_i ggml_backend_cuda_interface = {
/* .graph_compute = */ ggml_backend_cuda_graph_compute,
/* .supports_op = */ ggml_backend_cuda_supports_op,
/* .offload_op = */ ggml_backend_cuda_offload_op,
- /* .event_new = */ ggml_backend_cuda_event_new,
- /* .event_free = */ ggml_backend_cuda_event_free,
- /* .event_record = */ ggml_backend_cuda_event_record,
- /* .event_wait = */ ggml_backend_cuda_event_wait,
+ /* .event_new = */ NULL,//ggml_backend_cuda_event_new,
+ /* .event_free = */ NULL,//ggml_backend_cuda_event_free,
+ /* .event_record = */ NULL,//ggml_backend_cuda_event_record,
+ /* .event_wait = */ NULL,//ggml_backend_cuda_event_wait,
/* .event_synchronize = */ ggml_backend_cuda_event_synchronize,
}; |
@slaren oh wow! Rebuilt on fresh checkout with your patch, and so far I think it works, just tested with single chat with one character in SillyTavern and it seems to be generating sensible stuff (as much as one can expect from model at this time :) ). Tested on couple models, command-r Q6 and noromaid mixtral Q4_K_M. I'll try some more stuff later today, but I think you have here a winning patch! 👍 |
Ok testing some more generations, using mixtral, all seems to be working fine! Huge thanks @slaren ! |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I am running several large language models on my small GPU cluster using the latest version of llama.cpp. The GPU cluster has multiple NVIDIA RTX 3070 GPUs. Inference on a single GPU, enforced by
CUDA_VISIBLE_DEVICES=0
, of different flavors of LLMs (llama, mistral, mistral german) works as expected, i.e. the model answers my prompt in the appropriate language (German/English) .Current Behavior
However, the model is simply returning characters and sharps (#) once I run inference on multiple GPUs:
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
Failure Information (for bugs)
The issue seems to be unrelated to the actual model as well as its size. I'm observing this issue with llama models ranging from 7B to 70B parameters.
It almost doesn't depend on the choice of
-ngl
as the model is producing broken output for any value larger than 0. Context size-c
, generated tokens-n
,--no-mmap
,-nommq
don't resolve the issue either.Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
Get model in GGUF format e.g. huggingface.co/TheBloke/Llama-2-7B-GGUF)
Query model
Failure Logs
Verbose console output for inference of llama-2 7B: output.log
Make log: make.log
The text was updated successfully, but these errors were encountered: