llama : fix not enough space in buffer with Qwen #5086
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #5082
This was caused by a minor reordering of the nodes, which caused the measured compute buffer size to not be accurate. Changing the order of the nodes fixes the issue for all the models I could test. On that note, it would be very useful to have a directory with links to gguf files of all the base models supported by llama.cpp.
Ultimately, I think that the current approach for ggml-alloc is always going to be susceptible to these issues, because small changes in the sizes of a tensor can cause the following tensors to be allocated in a different block than during measure, causing different types of fragmentation that leads to out of memory errors.
In the long term, a more robust solution is needed, such as always assigning the same offset within the buffer to the tensors, regardless of their size, then it would always work as long as the tensors are never larger than during measure. This should also make ggml-alloc faster during inference since we could skip the whole allocation process and simply reuse the same allocations obtained during measure, and maybe could allow for a more exhaustive search for a more optimal way to allocate tensor during measure, since it would only happen during initialization.