cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy #6208

slaren · 2024-03-21T19:00:17Z

Adds the build flag LLAMA_CUDA_NO_PEER_COPY to disable peer to peer copies, which causes ggml-backend to fallback to copy over the CPU. This also disables pipeline parallelism.

Ref: #3772 (comment)

@morphles can you check if building this PR with LLAMA_CUDA_NO_PEER_COPY also fixes your issue? It should have the same effect as the patch that you tested previously.

morphles · 2024-03-21T19:40:39Z

I think something is not right, first if I build with make LLAMA_HIPBLAS=1 LLAMA_CUDA_NO_PEER_COPY=1 I think it somehow manages to not set up NVCCFLAGS as it should (at least from build lines I do not see -DGGML_CUDA_NO_PEER_COPY in there, but maybe I'm using make wrong here? But in the end model loads, but is still producing nonsense as on base build.

If I build with cmake like this:

 CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake -H. -Bbuild -DLLAMA_HIPBLAS=ON -DLLAMA_CUDA_NO_PEER_COPY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -- -j 16

I get failing assertion:

llama_new_context_with_model: graph nodes  = 1668
llama_new_context_with_model: graph splits = 3
CUDA error: shared object initialization failed
  current device: 0, in function ggml_cuda_op_flatten at /home/morphles/n/customized_llama/PR_check/llama.cpp/ggml-cuda.cu:9960
  hipGetLastError()
GGML_ASSERT: /home/morphles/n/customized_llama/PR_check/llama.cpp/ggml-cuda.cu:193: !"CUDA error"

Still maybe I'm using cmake wrong too?

slaren · 2024-03-21T19:52:05Z

@morphles should be fixed now, I didn't realize that the HIP build handles these flags separately.

morphles · 2024-03-21T20:13:21Z

Now seems to be good, tested both make and cmake as above. Amazing, and huge thanks!

…ggml-org#6208) * cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy * add LLAMA_CUDA_NO_PEER_COPY to HIP build

cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy

cfbf76b

add LLAMA_CUDA_NO_PEER_COPY to HIP build

01708f7

ggerganov approved these changes Mar 22, 2024

View reviewed changes

slaren merged commit 2f0e81e into master Mar 22, 2024

slaren deleted the sl/rocm-radeon-multi-gpu-workaround branch March 22, 2024 13:05

slaren mentioned this pull request Apr 7, 2024

Multi-GPU support for AMD? #3051

Closed

aymane-eljerari mentioned this pull request Jun 18, 2024

Bug: Llama3 8B Instruct Model outputting nonsensical text on AMD GPUs. #7984

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy #6208

cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy #6208

slaren commented Mar 21, 2024

morphles commented Mar 21, 2024

slaren commented Mar 21, 2024

morphles commented Mar 21, 2024

cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy #6208

cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy #6208

Conversation

slaren commented Mar 21, 2024

morphles commented Mar 21, 2024

slaren commented Mar 21, 2024

morphles commented Mar 21, 2024