-
Notifications
You must be signed in to change notification settings - Fork 12k
Bug: having more than one context doesn't work as expected with the Vulkan backend #7575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you @0cc4m! |
@0cc4m I've tested the latest release, and decoding now works well with more than one context 🚀 This issue can be replicated with this code: void embed_text(const char * text, llama_model * model, llama_context * context) {
std::vector<llama_token> tokens = llama_tokenize(model, text, false, false);
auto n_tokens = tokens.size();
auto batch = llama_batch_init(n_tokens, 0, 1);
for (size_t i = 0; i < n_tokens; i++) {
llama_batch_add(batch, tokens[i], i, { 0 }, false);
}
batch.logits[batch.n_tokens - 1] = true;
llama_decode(context, batch);
llama_synchronize(context);
const int n_embd = llama_n_embd(model);
const auto* embeddings = llama_get_embeddings_seq(context, 0);
if (embeddings == NULL) {
embeddings = llama_get_embeddings_ith(context, tokens.size() - 1);
if (embeddings == NULL) {
printf("Failed to get embedding");
}
}
if (embeddings != NULL) {
printf("Embeddings: ");
for (size_t i = 0; i < n_embd; ++i) {
printf("%f ", embeddings[i]);
}
}
llama_batch_free(batch);
}
void main() {
llama_backend_init();
auto model_params = llama_model_default_params();
model_params.n_gpu_layers = 33;
auto model_path = "/home/user/models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf";
auto model = llama_load_model_from_file(model_path, model_params);
auto text1 = "Hi there";
auto text2 = "Hello there";
auto context_params = llama_context_default_params();
context_params.embeddings = true;
context_params.seed = time(NULL);
context_params.n_ctx = 4096;
context_params.n_threads = 6;
context_params.n_threads_batch = context_params.n_threads;
context_params.n_batch = 512;
context_params.n_ubatch = 512;
auto context1 = llama_new_context_with_model(model, context_params);
auto context2 = llama_new_context_with_model(model, context_params);
// one of these threads causes the process to crash
std::thread thread1(embed_text, text1, model, context1);
std::thread thread2(embed_text, text2, model, context2);
thread1.join();
thread2.join();
llama_free(context1);
llama_free(context2);
llama_free_model(model);
llama_backend_free();
} |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Uh oh!
There was an error while loading. Please reload this page.
What happened?
There seems to be some kind of memory overlap between contexts created with the same model with the Vulkan backend when the contexts are loaded at the same time.
Freeing the first context before creating the second one works as expected, though.
Other backends support having multiple contexts at the same time, so I think Vulkan should support it, too.
The following code crashes with
signal SIGSEGV, Segmentation fault
:Using
gdb
shows this stack trace:I've used this model in this code.
Name and Version
I tested the above code with release
b3012
.What operating system are you seeing the problem on?
Linux
Relevant log output
The text was updated successfully, but these errors were encountered: