-
Notifications
You must be signed in to change notification settings - Fork 13.8k
llama : update worst-case graph for unified cache #17379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sl/realloc-error
Are you sure you want to change the base?
Conversation
| throw std::runtime_error("failed to initialize memory context"); | ||
| } | ||
|
|
||
| const uint32_t n_seqs = cparams.kv_unified ? 1 : cparams.n_seq_max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recall why this cparams.kv_unified check was added in #14363. It seems unnecessary now and removing it gives a better estimate of the worst-case graph.
|
There are 2 remaining issues with the CI:
|
|
The CUDA failure is caused by different pp and tg graphs. This happens because this model has some weights in the input layer, that are loaded in the CPU, but copied to CUDA with large batches. A solution to that would be to disable op offloading for these tests with |
Could this be the same issue I described at #17033 (comment)? I haven't had a chance to get back to this. |
I can test this tomorrow to confirm |
|
Yes, if I disable the graph optimization logic in the vulkan backend, the diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index bb3eb977c..ca12d2d1f 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -12925,6 +12925,7 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
// Sort the graph for improved parallelism.
static void ggml_vk_graph_optimize(ggml_backend_t backend, struct ggml_cgraph * graph)
{
+ return;
VK_LOG_DEBUG("ggml_vk_graph_optimize(" << graph->n_nodes << " nodes)");
ggml_backend_vk_context * ctx = (ggml_backend_vk_context *)backend->context;
|
|
Think we can merge this into #17276 and then figure out what to do with the remaining issue of graph optimization causing reallocations. (btw, I'm still not sure I understand in which cases this happens) |
target #17276
Fixes https://github.com/ggml-org/llama.cpp/actions/runs/19443607071/job/55632677520?pr=17276#step:3:4553