Closed
Description
I tested Command R Plus on 4 L20 cards with maximum 64K context, with 64 layers offloaded to GPU, 16 layers per card.
My prompt is relatively large, it costs around 50K tokens. During the prefill phase, llama.cpp crashed at ~40K tokens.
Here's the error message:
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_synchronize at /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:2403
cudaStreamSynchronize(cuda_ctx->stream())
GGML_ASSERT: /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
Aborted (core dumped)
I'm using @dranger003 's Q6_K model with the perplexity test fix in #6491 applied.
I also tested on 32K context and it works fine.