Skip to content

Command R Plus crashed on large context (~40K) with CUDA #6948

Closed
@TomoshibiAkira

Description

@TomoshibiAkira

I tested Command R Plus on 4 L20 cards with maximum 64K context, with 64 layers offloaded to GPU, 16 layers per card.
My prompt is relatively large, it costs around 50K tokens. During the prefill phase, llama.cpp crashed at ~40K tokens.

Here's the error message:

CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:2403
  cudaStreamSynchronize(cuda_ctx->stream())
GGML_ASSERT: /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
Aborted (core dumped)

I'm using @dranger003 's Q6_K model with the perplexity test fix in #6491 applied.
I also tested on 32K context and it works fine.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions