Skip to content

ggml : improve memory allocation for weights and similar lists of tensors #578

@slaren

Description

@slaren

There are several patterns used to allocate memory for a list of fixed size tensor, such as model weights:

  • Manually calculating the number of elements of each tensor and adding it all up
  • Creating the tensors in a no-alloc context, adding them to list or map, or obtaining them by name from a ggml_context with ggml_get_tensor, summing their sizes and finally allocating them (the last one is $O(N^2)$ )
  • Creating the tensors in a no-alloc context, allocate the weights manually with ggml-alloc, first with a measure allocator and then again with the exact memory requirements (current llama.cpp finetune)
  • Creating the tensors in a no-alloc context, then enumerating the tensors in the context and summing their sizes (new finetune in ggml : add context enumeration functions llama.cpp#3605)
  • Create a ggml_context with a lot of memory and hope for the best

This becomes significantly more complicated when the weights have to be split between different backends (current llama.cpp and ggml-backend wip).

For something so basic, this is a lot more complicated than it should, and we should have a normalized way to do this. At the most basic level, it could be simply a function to automatically allocate all the tensors created in a no-alloc context with the exact memory requirements. Support for multiple backends will be more complicated.

This could also be useful for debugging operations in compute contexts, where it might be desirable to allocate memory for every tensor in the graph to be able to inspect the results of each op later.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions