Bug: KV cache load/save is slow #8915
Labels
bug-unconfirmed
medium severity
Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
stale
What happened?
I wrote a KV cache cache, and then benchmarked it.
llama_state_seq_get_size
,llama_state_seq_get_data
, andllama_state_seq_set_data
are slow enough that it is significantly (13x) better to just start over from nothing each time.However, from looking through the code, I think there is opportunity to improve quite a lot. (It is unclear to me whether these improvements will be sufficient to make it worth managing an external cache, but in theory I think it ought to be possible.)
Here are a few observations, starting with just the
get
APIs...llama_state_seq_get_size
does a full copy from the GPU and throws it away. (My cache management implementation is in Go, so for GC/allocator reasons, I need the size up front.)In
write_kv_cache_data
, we have lots of double-copying, from GPU to staging area and then staging area to destination. For example:An extremely crude benchmark suggests that this double-copy is ~5% of the runtime of
llama_state_seq_get_data
.We call
ggml_backend_tensor_get
a lot of times. In the case in which the tensors are contiguous, it would probably be significantly faster to do a single transfer. A back of the envelope calculation about PCIe data transfer rates suggests that we are nowhere near saturating the bus, and there is very little computation going on, which suggests per-transfer latency overhead as a major culprit.I'm using an RTX 4090 with a server-grade motherboard.
cc @abetlen
cc @slaren (per suggestion of @abetlen)
Name and Version
$ ./llama-cli --version
version: 3488 (75af08c)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response
The text was updated successfully, but these errors were encountered: