Open
Description
Name and Version
This appears to be the same bug as noted in this issue:
#7575
We are trying to do inference from multiple threads with some contexts having LORAs loaded and others not (so batched inference isn't going to work). If I may ask, has there been any progress on this issue? We are currently using a build from mid September 2024.
Operating systems
Windows
GGML backends
Vulkan
Hardware
2x Nvidia RTX 3090s.
Models
Meta Llama 3.2 3B 8 bit quant.
Problem description & steps to reproduce
When we run llama_decode with different contexts in different threads, we get a crash. The only way around this appears to be to strictly control access to llama_decode and LORA loading via a mutex.
First Bad Commit
No response
Relevant log output
It appears to be an error in vkQueueSubmit, line 1101.