Skip to content

Conversation

@ggerganov
Copy link
Member

throw std::runtime_error("failed to initialize memory context");
}

const uint32_t n_seqs = cparams.kv_unified ? 1 : cparams.n_seq_max;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall why this cparams.kv_unified check was added in #14363. It seems unnecessary now and removing it gives a better estimate of the worst-case graph.

@ggerganov
Copy link
Member Author

There are 2 remaining issues with the CI:

@slaren
Copy link
Member

slaren commented Nov 19, 2025

The CUDA failure is caused by different pp and tg graphs. This happens because this model has some weights in the input layer, that are loaded in the CPU, but copied to CUDA with large batches. A solution to that would be to disable op offloading for these tests with --no-op-offload. I have not been able to reproduce the Vulkan issue, llvmpipe is crashing on my system for some reason.

@jeffbolznv
Copy link
Collaborator

* The vulkan `test-thread-safety` is triggering graph reallocation which I am not yet sure what is the cause

Could this be the same issue I described at #17033 (comment)? I haven't had a chance to get back to this.

@ggerganov
Copy link
Member Author

Could this be the same issue I described at #17033 (comment)? I haven't had a chance to get back to this.

I can test this tomorrow to confirm

@ggerganov
Copy link
Member Author

Yes, if I disable the graph optimization logic in the vulkan backend, the test-thread-safety runs without causing reallocations:

diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index bb3eb977c..ca12d2d1f 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -12925,6 +12925,7 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
 // Sort the graph for improved parallelism.
 static void ggml_vk_graph_optimize(ggml_backend_t backend, struct ggml_cgraph * graph)
 {
+    return;
     VK_LOG_DEBUG("ggml_vk_graph_optimize(" << graph->n_nodes << " nodes)");
     ggml_backend_vk_context * ctx = (ggml_backend_vk_context *)backend->context;
 

@github-actions github-actions bot added the devops improvements to build systems and github actions label Nov 20, 2025
@ggerganov ggerganov marked this pull request as ready for review November 20, 2025 19:54
@ggerganov
Copy link
Member Author

Think we can merge this into #17276 and then figure out what to do with the remaining issue of graph optimization causing reallocations. (btw, I'm still not sure I understand in which cases this happens)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions examples

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants