Skip to content

llama : fix not enough space in buffer with Qwen #5086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 22, 2024
Merged

Conversation

slaren
Copy link
Member

@slaren slaren commented Jan 22, 2024

Fixes #5082

This was caused by a minor reordering of the nodes, which caused the measured compute buffer size to not be accurate. Changing the order of the nodes fixes the issue for all the models I could test. On that note, it would be very useful to have a directory with links to gguf files of all the base models supported by llama.cpp.

Ultimately, I think that the current approach for ggml-alloc is always going to be susceptible to these issues, because small changes in the sizes of a tensor can cause the following tensors to be allocated in a different block than during measure, causing different types of fragmentation that leads to out of memory errors.

In the long term, a more robust solution is needed, such as always assigning the same offset within the buffer to the tensors, regardless of their size, then it would always work as long as the tensors are never larger than during measure. This should also make ggml-alloc faster during inference since we could skip the whole allocation process and simply reuse the same allocations obtained during measure, and maybe could allow for a more exhaustive search for a more optimal way to allocate tensor during measure, since it would only happen during initialization.

@ggerganov
Copy link
Member

On that note, it would be very useful to have a directory with links to gguf files of all the base models supported by llama.cpp.

Do you mean like a text file with HF links, or? We can do that

@slaren
Copy link
Member Author

slaren commented Jan 22, 2024

Yes, in a file somewhere or even just a page in the wiki, I just want a list of models that I can download to test. As it is I don't even know where to find half the models supported.

@slaren slaren merged commit 011e8ec into master Jan 22, 2024
@slaren slaren deleted the sl/qwen-fix branch January 22, 2024 22:42
@ggerganov
Copy link
Member

Yes, in a file somewhere or even just a page in the wiki, I just want a list of models that I can download to test. As it is I don't even know where to find half the models supported.

Started making the list here: #5141

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[not enough space in the buffer error] Qwen model long prompt
2 participants