llama : fix not enough space in buffer with Qwen #5086

slaren · 2024-01-22T20:15:04Z

This was caused by a minor reordering of the nodes, which caused the measured compute buffer size to not be accurate. Changing the order of the nodes fixes the issue for all the models I could test. On that note, it would be very useful to have a directory with links to gguf files of all the base models supported by llama.cpp.

Ultimately, I think that the current approach for ggml-alloc is always going to be susceptible to these issues, because small changes in the sizes of a tensor can cause the following tensors to be allocated in a different block than during measure, causing different types of fragmentation that leads to out of memory errors.

In the long term, a more robust solution is needed, such as always assigning the same offset within the buffer to the tensors, regardless of their size, then it would always work as long as the tensors are never larger than during measure. This should also make ggml-alloc faster during inference since we could skip the whole allocation process and simply reuse the same allocations obtained during measure, and maybe could allow for a more exhaustive search for a more optimal way to allocate tensor during measure, since it would only happen during initialization.

ggerganov · 2024-01-22T22:34:04Z

On that note, it would be very useful to have a directory with links to gguf files of all the base models supported by llama.cpp.

Do you mean like a text file with HF links, or? We can do that

slaren · 2024-01-22T22:42:29Z

Yes, in a file somewhere or even just a page in the wiki, I just want a list of models that I can download to test. As it is I don't even know where to find half the models supported.

ggerganov · 2024-01-26T11:32:09Z

Yes, in a file somewhere or even just a page in the wiki, I just want a list of models that I can download to test. As it is I don't even know where to find half the models supported.

Started making the list here: #5141

llama : fix not enough space in buffer with Qwen

f0bb105

ggerganov approved these changes Jan 22, 2024

View reviewed changes

slaren merged commit 011e8ec into master Jan 22, 2024

slaren deleted the sl/qwen-fix branch January 22, 2024 22:42

ggerganov mentioned this pull request Jan 26, 2024

CUDA: assert when using batch size less than 129 #5140

Closed

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024

llama : fix not enough space in buffer with Qwen (ggml-org#5086)

6eee505

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

llama : fix not enough space in buffer with Qwen (ggml-org#5086)

b34f027

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : fix not enough space in buffer with Qwen #5086

llama : fix not enough space in buffer with Qwen #5086

Uh oh!

slaren commented Jan 22, 2024 •

edited

Loading

Uh oh!

ggerganov commented Jan 22, 2024

Uh oh!

slaren commented Jan 22, 2024 •

edited

Loading

Uh oh!

ggerganov commented Jan 26, 2024

Uh oh!

Uh oh!

llama : fix not enough space in buffer with Qwen #5086

llama : fix not enough space in buffer with Qwen #5086

Uh oh!

Conversation

slaren commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 22, 2024

Uh oh!

slaren commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 26, 2024

Uh oh!

Uh oh!

slaren commented Jan 22, 2024 •

edited

Loading

slaren commented Jan 22, 2024 •

edited

Loading