multi-gpu inference produces broken output #3772

nih23 · 2023-10-25T06:35:00Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I am running several large language models on my small GPU cluster using the latest version of llama.cpp. The GPU cluster has multiple NVIDIA RTX 3070 GPUs. Inference on a single GPU, enforced by CUDA_VISIBLE_DEVICES=0, of different flavors of LLMs (llama, mistral, mistral german) works as expected, i.e. the model answers my prompt in the appropriate language (German/English) .

CUDA_VISIBLE_DEVICES=0 ./main -ngl 99 -m ../LLM_stack/models/llama-2-7b.Q5_K_M.gguf --color -c 1500 --temp 0.01 -p "Why is the sky blue? Answer for a 5 year old child." -n 100
[...]

Why is the sky blue? Answer for a 5 year old child.
The sky is blue because of the scattering of light by molecules in the atmosphere. The sunlight that reaches us from space has all colors mixed together, but when it passes through our atmosphere, some of its color is scattered away. Blue light scatters more than other colors, so we see a blue sky.

Current Behavior

However, the model is simply returning characters and sharps (#) once I run inference on multiple GPUs:

CUDA_VISIBLE_DEVICES=0,1 ./main -ngl 99 -m ../LLM_stack/models/llama-2-7b.Q5_K_M.gguf --color -c 1500 --temp 0.01 -p "Why is the sky blue? Answer for a 5 year old child." -n 100`


Why is the sky blue? Answer for a 5 year old child. dispos###################################################################################################

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      39 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             4
On-line CPU(s) list:                0-3


Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
CPU family:                         6
Model:                              158
Thread(s) per core:                 1
Core(s) per socket:                 4
Socket(s):                          1
Stepping:                           9
CPU(s) scaling MHz:                 19%
CPU max MHz:                        4200.0000
CPU min MHz:                        800.0000
BogoMIPS:                           7599.80
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
Virtualization:                     VT-x
L1d cache:                          128 KiB (4 instances)
L1i cache:                          128 KiB (4 instances)
L2 cache:                           1 MiB (4 instances)
L3 cache:                           6 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-3
Vulnerability Gather data sampling: Vulnerable: No microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
Vulnerability Retbleed:             Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Vulnerable: No microcode
Vulnerability Tsx async abort:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled

Operating System, e.g. for Linux:

$ uname -a

Linux ml 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux

SDK version, e.g. for Linux:

$ python3 --version
Python 3.9.18

$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

$ g++ --version
g++ (Debian 12.2.0-14) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

$ nvidia-smi
Wed Oct 25 05:57:04 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070        On  | 00000000:02:00.0 Off |                  N/A |
|  0%   43C    P8              22W / 220W |      2MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3070        On  | 00000000:03:00.0 Off |                  N/A |
|  0%   45C    P8              15W / 220W |      2MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3070        On  | 00000000:04:00.0 Off |                  N/A |
|  0%   40C    P8              20W / 220W |      2MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
[...]

Failure Information (for bugs)

The issue seems to be unrelated to the actual model as well as its size. I'm observing this issue with llama models ranging from 7B to 70B parameters.
It almost doesn't depend on the choice of -ngl as the model is producing broken output for any value larger than 0. Context size -c, generated tokens -n, --no-mmap, -nommq don't resolve the issue either.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

Get code

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build with CUDA support

make LLAMA_CUBLAS=1

Get model in GGUF format e.g. huggingface.co/TheBloke/Llama-2-7B-GGUF)
Query model

CUDA_VISIBLE_DEVICES=0,1 ./main -ngl 99 -m ../LLM_stack/models/llama-2-7b.Q5_K_M.gguf --color -c 1500 --temp 0.01 -p "Why is the sky blue? Answer for a 5 year old child." -n 100

Failure Logs

Verbose console output for inference of llama-2 7B: output.log

Make log: make.log

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-10-25T07:02:44Z

Did it work before?
If it did, can you bisect where it stopped working?
Can you check if going back before 2b4ea35e56792064598e922e46d081e02bc96b94 fixes it?

nih23 · 2023-10-25T08:20:17Z

Yes! I just tested different commits to narrow down the issue: Multi-gpu inference has worked fine even on 8 GPUs until (including) 8b428c9. It seems that from 111163e something has broken (@JohannesGaessler).

JohannesGaessler · 2023-10-25T09:27:06Z

I cannot reproduce the issue using 3x P40. Are you running llama.cpp inside a virtual machine or WSL?

nih23 · 2023-10-25T09:34:20Z

Nope, the linux system and llama.cpp are directly accessing the hardware, i.e. no virtualisation is involved.

kotee4ko · 2023-10-25T12:03:12Z

The thing is that in ggml_op_mul_mat we using cudaMemcpyD2Async to put data from active >1 gpu to main gpu, but this is possible only when gpus has cross-dma feature, which is true only for large bar pci systems.

The solution is to use hip/cudaMemcpyDtoDAsync in a loop to fill slices in dst from data in src0 -- this allow to bypass crashes on devices without p2p access.

But there are second sort of bug present.
I can't say, yet, if it is AMD specific.

ggml-org/ggml#590

kotee4ko · 2023-10-25T20:37:13Z

Maybe, anybody could explain what are we expecting to achive in dst and in which form?
I think I can fix system code and make it right on both cuda/hip device/devices with/without p2p.

But I just cant understand math. Especially about src0 transposition -- call to ggml_is_transposed(src0) return False!

@ggerganov @JohannesGaessler @slaren @FSSRepo

Upd:
When op() is called, and control flow reach ggml_cuda_op_mul_mat_cublas() the next thing is taking place:

On solo gpu:
Convert src0 and src1 to f16, mul using hip/cublasGemmEx, convert to f32, ret.

On multi-gpu:
Almost same, but mul with hip/cublasSgemm, and return WITHOUT convertion to f32.

Maybe, this could lead to sigsegv later on memcpy2d.

mgolub2 · 2023-11-17T17:13:56Z

I’m also having this issue with 2x 4090s - it actually corrupts the model files when I use 2 GPUs. Both work fine by themselves using CUDA_VISIBLE_DEVICES , and both pass gpu_burn for an hour without issue too.

Tried CUDA 12.3, 12.1, rocky linux, and ubuntu.

dji-transpire · 2023-11-20T23:35:47Z

The STRANGEST part is that it works beautifully on my old box a dual Xeon DELL 7610 with two 1080TI and one M6000 but it produces only garbage on my newly built box an ASUS X99-e WS build with two 3090 24GB founders edition.

I copied the source and recompiled with the same make LLAMA_CUBLAS=1 and on the DELL with older 1080TI and even older M6000 24GB the 13B Llama 2 produces nice output, pretty decent speed, but on the ASUS with two 3090 it produces garbage. It works if I take one of the 3090 out, but what is the point, I want to use the Q4 70B model.

wookayin · 2023-11-21T01:00:15Z

@dji-transpire Can you check the versions for CUDA, CUDNN, CUBLAS, NVIDIA driver versions, or any relevant SDKs -- were they the same? The model (generation) of GPU are already different so this also might be one factor even if everything else is the same.

dji-transpire · 2023-11-22T13:23:01Z

Thanks!!! You nailed it! The old box is running the 535 driver, the new box runs the latest 545 driver.

Downgrading nvidia-dkms nvidia-utils and lib32-nividia--utils to 535 and putting these on the IgnorePkg list solved the issue. Now both 3090 founders edition cards play nicely with LLAMA 13B and Q4 70B.

So: Be careful with the 545 version of the Nvidia driver and multiple GPUs????

peteygao · 2023-11-28T07:51:48Z

@dji-transpire Also running into the same issue with 3x 1080Ti, running driver version 545.29.08. Which exact version of 535 did you revert to? Was it to the latest 535.129.03?

wookayin · 2023-11-28T08:19:44Z

Note: A workaround for this bug is to use the CMake flag -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 when building llama.cpp, as done in ollama/ollama#1261. Or more simply:

make LLAMA_CUBLAS=1 LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0

This will disable CUDA peer access completely and produce correct output when multiple GPUs are used.

Alumniminium · 2023-12-16T12:34:56Z

is there a way to make this work in textgeneration-webui without downgrading nvidia drivers?

morphles · 2024-03-21T09:48:06Z

Any news on this? For dual 7900 XTX I'm still getting garbage with hipBLAS build, regardless of model. But on single card it works. I tried the -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 option, but as it's CUDA did not have high hopes for it, and it did not help. Is there similar var for HIP maybe?

slaren · 2024-03-21T14:36:02Z

https://rocm.docs.amd.com/projects/radeon/en/latest/docs/limitations.html

morphles · 2024-03-21T14:56:01Z

@slaren yeah I know that, and I have no hope of it being fixed on AMD side soon, so I have very little hope in using pytorch with dual cards. Yet llama.cpp is much much better imo :) and flexible. I already have them working via vulkan, just mixtral on vk is still missing, but I know 0cc4m is working on it. But even without it I think llama.cpp already does some "manual workarounds" for what underlying libs do not provide, thus it that one 8x is on CPU other via chipset problem that I have, can be worked around via some slower "manual" data copying would still be nice :) . In any case for now vulkan seems like my best bet, so I'll be waiting for updates from 0cc4m :)

slaren · 2024-03-21T14:58:24Z

Somebody with access to dual 7900 XTX would need to diagnose the issue. AFAIK nobody who is working on the CUDA/HIP backend at the moment has access to this hardware.

morphles · 2024-03-21T15:13:12Z

Yeah, understandable :) for now I'm mostly happy with vulkan, and when mixstral is supported, I think I'll have basically no need for HIP build. Still if this somehow progresses, will also be nice to know. Thanks!

slaren · 2024-03-21T15:20:06Z

Can you test if it works with this change? (do not use -sm row).

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 04c6f5d0..06af740e 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -797,7 +797,7 @@ static ggml_backend_buffer_i ggml_backend_cuda_buffer_interface = {
     /* .init_tensor     = */ ggml_backend_cuda_buffer_init_tensor,
     /* .set_tensor      = */ ggml_backend_cuda_buffer_set_tensor,
     /* .get_tensor      = */ ggml_backend_cuda_buffer_get_tensor,
-    /* .cpy_tensor      = */ ggml_backend_cuda_buffer_cpy_tensor,
+    /* .cpy_tensor      = */ NULL,//ggml_backend_cuda_buffer_cpy_tensor,
     /* .clear           = */ ggml_backend_cuda_buffer_clear,
     /* .reset           = */ NULL,
 };
@@ -11584,7 +11584,7 @@ static ggml_backend_i ggml_backend_cuda_interface = {
     /* .get_default_buffer_type = */ ggml_backend_cuda_get_default_buffer_type,
     /* .set_tensor_async        = */ ggml_backend_cuda_set_tensor_async,
     /* .get_tensor_async        = */ ggml_backend_cuda_get_tensor_async,
-    /* .cpy_tensor_async        = */ ggml_backend_cuda_cpy_tensor_async,
+    /* .cpy_tensor_async        = */ NULL,//ggml_backend_cuda_cpy_tensor_async,
     /* .synchronize             = */ ggml_backend_cuda_synchronize,
     /* .graph_plan_create       = */ NULL,
     /* .graph_plan_free         = */ NULL,
@@ -11592,10 +11592,10 @@ static ggml_backend_i ggml_backend_cuda_interface = {
     /* .graph_compute           = */ ggml_backend_cuda_graph_compute,
     /* .supports_op             = */ ggml_backend_cuda_supports_op,
     /* .offload_op              = */ ggml_backend_cuda_offload_op,
-    /* .event_new               = */ ggml_backend_cuda_event_new,
-    /* .event_free              = */ ggml_backend_cuda_event_free,
-    /* .event_record            = */ ggml_backend_cuda_event_record,
-    /* .event_wait              = */ ggml_backend_cuda_event_wait,
+    /* .event_new               = */ NULL,//ggml_backend_cuda_event_new,
+    /* .event_free              = */ NULL,//ggml_backend_cuda_event_free,
+    /* .event_record            = */ NULL,//ggml_backend_cuda_event_record,
+    /* .event_wait              = */ NULL,//ggml_backend_cuda_event_wait,
     /* .event_synchronize       = */ ggml_backend_cuda_event_synchronize,
 };

morphles · 2024-03-21T15:31:13Z

@slaren oh wow! Rebuilt on fresh checkout with your patch, and so far I think it works, just tested with single chat with one character in SillyTavern and it seems to be generating sensible stuff (as much as one can expect from model at this time :) ). Tested on couple models, command-r Q6 and noromaid mixtral Q4_K_M. I'll try some more stuff later today, but I think you have here a winning patch! 👍

morphles · 2024-03-21T17:26:06Z

Ok testing some more generations, using mixtral, all seems to be working fine! Huge thanks @slaren !

nih23 added the bug Something isn't working label Oct 25, 2023

bojak83318 mentioned this issue Nov 10, 2023

Stuck loading VRAM ROCm multi gpu #3991

Closed

wookayin mentioned this issue Nov 15, 2023

garbage output on small models spread to many GPUs ollama/ollama#961

Closed

wookayin mentioned this issue Nov 24, 2023

Disable CUDA peer access as a workaround for multi-gpu inference bug ollama/ollama#1261

Merged

viktor-ferenczi mentioned this issue Nov 29, 2023

Nvidia drivers 545.29.02 broken --tensor-parallel-size vllm-project/vllm#1801

Closed

github-actions bot added the stale label Mar 19, 2024

slaren mentioned this issue Mar 21, 2024

cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy #6208

Merged

github-actions bot removed the stale label Mar 22, 2024

github-actions bot added the stale label Apr 22, 2024

github-actions bot removed the stale label May 2, 2024

etemiz mentioned this issue May 25, 2024

HIPBLAS / ROCm low prompt eval performance #7533

Closed

IMbackK closed this as completed Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu inference produces broken output #3772

multi-gpu inference produces broken output #3772

nih23 commented Oct 25, 2023

ggerganov commented Oct 25, 2023

nih23 commented Oct 25, 2023

JohannesGaessler commented Oct 25, 2023

nih23 commented Oct 25, 2023

kotee4ko commented Oct 25, 2023 •

edited

Loading

kotee4ko commented Oct 25, 2023 •

edited

Loading

mgolub2 commented Nov 17, 2023

dji-transpire commented Nov 20, 2023

wookayin commented Nov 21, 2023 •

edited

Loading

dji-transpire commented Nov 22, 2023

peteygao commented Nov 28, 2023

wookayin commented Nov 28, 2023 •

edited

Loading

Alumniminium commented Dec 16, 2023

morphles commented Mar 21, 2024

slaren commented Mar 21, 2024

morphles commented Mar 21, 2024

slaren commented Mar 21, 2024

morphles commented Mar 21, 2024

slaren commented Mar 21, 2024 •

edited

Loading

morphles commented Mar 21, 2024

morphles commented Mar 21, 2024

multi-gpu inference produces broken output #3772

multi-gpu inference produces broken output #3772

Comments

nih23 commented Oct 25, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

ggerganov commented Oct 25, 2023

nih23 commented Oct 25, 2023

JohannesGaessler commented Oct 25, 2023

nih23 commented Oct 25, 2023

kotee4ko commented Oct 25, 2023 • edited Loading

kotee4ko commented Oct 25, 2023 • edited Loading

mgolub2 commented Nov 17, 2023

dji-transpire commented Nov 20, 2023

wookayin commented Nov 21, 2023 • edited Loading

dji-transpire commented Nov 22, 2023

peteygao commented Nov 28, 2023

wookayin commented Nov 28, 2023 • edited Loading

Alumniminium commented Dec 16, 2023

morphles commented Mar 21, 2024

slaren commented Mar 21, 2024

morphles commented Mar 21, 2024

slaren commented Mar 21, 2024

morphles commented Mar 21, 2024

slaren commented Mar 21, 2024 • edited Loading

morphles commented Mar 21, 2024

morphles commented Mar 21, 2024

kotee4ko commented Oct 25, 2023 •

edited

Loading

kotee4ko commented Oct 25, 2023 •

edited

Loading

wookayin commented Nov 21, 2023 •

edited

Loading

wookayin commented Nov 28, 2023 •

edited

Loading

slaren commented Mar 21, 2024 •

edited

Loading