Command R Plus crashed on large context (~40K) with CUDA

I tested Command R Plus on 4 L20 cards with maximum 64K context, with 64 layers offloaded to GPU, 16 layers per card.
My prompt is relatively large, it costs around 50K tokens. During the prefill phase, llama.cpp crashed at ~40K tokens. 

Here's the error message:
```
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:2403
  cudaStreamSynchronize(cuda_ctx->stream())
GGML_ASSERT: /root/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
Aborted (core dumped)
```

I'm using @dranger003 's Q6_K model with the perplexity test fix in #6491 applied.
I also tested on 32K context and it works fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Command R Plus crashed on large context (~40K) with CUDA #6948

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Command R Plus crashed on large context (~40K) with CUDA #6948

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions