llama : run all KQV ops on the CPU with no KV offload #5049

slaren · 2024-01-20T13:18:58Z

This change restores the behavior such that all the attention ops are run on the CPU when not offloading the KV. The net effect is that the amount of data that needs to be transferred between the CPU and GPU is much smaller, and performance should improve for large contexts, but for small contexts prompt/batch performance may be worse.

This should also fix nkvo with CUDA, but the underlying issue is still not known.

ggml-ci

JianbangZ · 2024-01-22T17:51:00Z

@slaren As I mentioned somewhere else, this commit broke some stuff
#5082

ggml-ci

llama : run all KQV ops on the CPU with no KV offload

16b7e83

ggml-ci

ggerganov approved these changes Jan 20, 2024

View reviewed changes

ggerganov merged commit 6df465a into master Jan 20, 2024

slaren deleted the sl/nkvo-fix branch January 20, 2024 15:16

slaren mentioned this pull request Jan 20, 2024

Token generation broken on CUDA when offload_kqv is false #4991

Closed

JianbangZ mentioned this pull request Jan 22, 2024

[not enough space in the buffer error] Qwen model long prompt #5082

Closed

crasm pushed a commit that referenced this pull request Jan 23, 2024

llama : run all KQV ops on the CPU with no KV offload (#5049)

2cb253c

ggml-ci

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024

llama : run all KQV ops on the CPU with no KV offload (ggml-org#5049)

0baa23f

ggml-ci

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

llama : run all KQV ops on the CPU with no KV offload (ggml-org#5049)

ca740f3

ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : run all KQV ops on the CPU with no KV offload #5049

llama : run all KQV ops on the CPU with no KV offload #5049

Uh oh!

slaren commented Jan 20, 2024

Uh oh!

JianbangZ commented Jan 22, 2024 •

edited

Loading

Uh oh!

Uh oh!

llama : run all KQV ops on the CPU with no KV offload #5049

llama : run all KQV ops on the CPU with no KV offload #5049

Uh oh!

Conversation

slaren commented Jan 20, 2024

Uh oh!

JianbangZ commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

JianbangZ commented Jan 22, 2024 •

edited

Loading