Skip to content

Recommended ways to conserve memory - scratch buffers / graph allocator #1001

@Tianyue-Zhao

Description

@Tianyue-Zhao

I'm implementing a visual transformer with GGML on the CPU backend, and I'm seeing very high memory usage when calling the ggml_backend_graph_compute function, so I opened this issue to inquire about the recommended ways to save memory.

My visual transformer has 24 transformer layers in total, and all computation and weights are GGML_TYPE_F32.
The input image shape is 1120 x 1120 x 3 x 1, and the transformer layer shape is 6401 x 1024 x 1.
With all 24 layers defined in the graph, calling ggml_backend_graph_compute uses all 128G of CPU RAM on my computer, but this is still not enough. Before calling the compute function, the memory usage is low at only about 3GB. Then, the memory usage increases rapidly while the ggml_backend_graph_compute function is running.

I am able to run the transformer and get correct results if I only define a few transformer layers. It looks like memory used to store the results from the previous transformer layers isn't being re-used for later layers, which causes the memory usage to grow linearly w.r.t. the number of transformer layers. The main culprit for high memory usage in the layers themselves is the attention mechanism, which uses several 6400 x 6400 x 16 tensors.

I read that by using scratch buffers, I can allocate the tensors in each transformer layer in the exact same location, so that I will only be using RAM to store results from two layers instead of 24. However, I read in this discussion (graph allocator) that scratch buffers had been replaced by the graph allocator. Is it still recommended to use scratch buffers for conserving memory today?

My code currently declares one context for the weight and one context for computation. Both contexts are declared with the no_alloc flag, and then space is allocated with ggml_backend_alloc_ctx_tensors.

Is it correct that the graph allocator should be automatically re-using the space allocated for previous layers in a transformer? If so, what might I be doing wrong to cause the graph allocator to not do so?

The graph information is as follows. Each transformer layer has 79 nodes, and there are 19 nodes outside of the transformer layers.

The full graph with 24 transformer layers is too long and cannot be posted. It has 1915 nodes.

Graph print-out with 1 transformer layer, 98 nodes total
=== GRAPH ===
n_nodes = 98
 -   0: [  1024,     1,     1]           REPEAT  
 -   1: [   588,    80,    80]           IM2COL  
 -   2: [   588,  6400,     1]          RESHAPE  
 -   3: [   588,  1024,     1]          RESHAPE  
 -   4: [  6400,  1024,     1]          MUL_MAT  
 -   5: [    80,    80,     1]          RESHAPE  
 -   6: [    80,    80,  1024]          PERMUTE  
 -   7: [    80,    80,  1024]             CONT  
 -   8: [     1,     1,  1024]          RESHAPE  
 -   9: [    80,    80,  1024]              ADD  
 -  10: [  6400,  1024,     1]          RESHAPE  
 -  11: [  1024,  6400,     1]        TRANSPOSE  
 -  12: [  1024,  6400,     1]             CONT  
 -  13: [  1024,  6401,     1]           CONCAT  
 -  14: [  1024,  6401,     1]              ADD  
 -  15: [  1024,  6401,     1]             NORM  
 -  16: [  1024,  6401,     1]              MUL  
 -  17: [  1024,  6401,     1]              ADD  
 -  18: [  1024,  6401,     1]          MUL_MAT  
 -  19: [  1024,  6401,     1]              ADD  
 -  20: [    64,    16,  6401]          RESHAPE  
 -  21: [  6401,    64,    16]          PERMUTE  
 -  22: [  6401,    64,    16]             CONT  
 -  23: [  1024,  6401,     1]          MUL_MAT  
 -  24: [    64,    16,  6401]          RESHAPE  
 -  25: [    64,  6401,    16]          PERMUTE  
 -  26: [    64,  6401,    16]             CONT  
 -  27: [    64,     1,    16]             VIEW  
 -  28: [    64,  6400,    16]             VIEW  
 -  29: [    64,  6400,    16]             CONT  
 -  30: [    64,  6400,    16]              MUL  
 -  31: [     2,    32,  6400]          RESHAPE  
 -  32: [    16,    32,  6400]          PERMUTE  
 -  33: [    16,    32,  6400]             CONT  
 -  34: [    16,    32,  6400]             VIEW  
 -  35: [    16,    32,  6400]            SCALE  
 -  36: [    16,    32,  6400]             VIEW  
 -  37: [    16,    32,  6400]           CONCAT  
 -  38: [     2,    32,  6400]          PERMUTE  
 -  39: [     2,    32,  6400]             CONT  
 -  40: [    64,  6400,    16]          RESHAPE  
 -  41: [    64,  6400,    16]              MUL  
 -  42: [    64,  6400,    16]              ADD  
 -  43: [    64,  6401,    16]           CONCAT  
 -  44: [  1024,  6401,     1]          MUL_MAT  
 -  45: [  1024,  6401,     1]              ADD  
 -  46: [    64,    16,  6401]          RESHAPE  
 -  47: [    64,  6401,    16]          PERMUTE  
 -  48: [    64,  6401,    16]             CONT  
 -  49: [    64,     1,    16]             VIEW  
 -  50: [    64,  6400,    16]             VIEW  
 -  51: [    64,  6400,    16]             CONT  
 -  52: [    64,  6400,    16]              MUL  
 -  53: [     2,    32,  6400]          RESHAPE  
 -  54: [    16,    32,  6400]          PERMUTE  
 -  55: [    16,    32,  6400]             CONT  
 -  56: [    16,    32,  6400]             VIEW  
 -  57: [    16,    32,  6400]            SCALE  
 -  58: [    16,    32,  6400]             VIEW  
 -  59: [    16,    32,  6400]           CONCAT  
 -  60: [     2,    32,  6400]          PERMUTE  
 -  61: [     2,    32,  6400]             CONT  
 -  62: [    64,  6400,    16]          RESHAPE  
 -  63: [    64,  6400,    16]              MUL  
 -  64: [    64,  6400,    16]              ADD  
 -  65: [    64,  6401,    16]           CONCAT  
 -  66: [    64,  6401,    16]            SCALE  
 -  67: [  6401,  6401,    16]          MUL_MAT  
 -  68: [  6401,  6401,    16]         SOFT_MAX  
 -  69: [    64,  6401,    16]          MUL_MAT  
 -  70: [    64,    16,  6401]          PERMUTE  
 -  71: [    64,    16,  6401]             CONT  
 -  72: [  1024,  6401,     1]          RESHAPE  
 -  73: [  1024,  6401,     1]             NORM  
 -  74: [  1024,  6401,     1]              MUL  
 -  75: [  1024,  6401,     1]              ADD  
 -  76: [  1024,  6401,     1]          MUL_MAT  
 -  77: [  1024,  6401,     1]              ADD  
 -  78: [  1024,  6401,     1]              ADD  
 -  79: [  1024,  6401,     1]             NORM  
 -  80: [  1024,  6401,     1]              MUL  
 -  81: [  1024,  6401,     1]              ADD  
 -  82: [  2730,  6401,     1]          MUL_MAT  
 -  83: [  2730,  6401,     1]              ADD  
 -  84: [  2730,  6401,     1]            UNARY  
 -  85: [  2730,  6401,     1]          MUL_MAT  
 -  86: [  2730,  6401,     1]              ADD  
 -  87: [  2730,  6401,     1]              MUL  
 -  88: [  2730,  6401,     1]             NORM  
 -  89: [  2730,  6401,     1]              MUL  
 -  90: [  2730,  6401,     1]              ADD  
 -  91: [  1024,  6401,     1]          MUL_MAT  
 -  92: [  1024,  6401,     1]              ADD  
 -  93: [  1024,  6401,     1]              ADD  
 -  94: [  1024,  6401,     1]            SCALE  
 -  95: [  1024,  6401,     1]            SCALE  
 -  96: [  1024,  6400,     1]             VIEW  
 -  97: [  1024,  6400,     1]              ADD  
n_leafs = 29
 -   0: [  1024,     1]     NONE vit.model.cls_token
 -   1: [    14,    14]     NONE vit.model.patch_embed.proj.weight
 -   2: [  1120,  1120]     NONE           leaf_2
 -   3: [  1024,     1]     NONE vit.model.patch_embed.proj.bias
 -   4: [  1024,  6401]     NONE vit.model.pos_embed
 -   5: [  1024,  1024]     NONE vit.model.blocks.0.attn.proj.weight
 -   6: [  1024,  1024]     NONE vit.model.blocks.0.attn.v_proj.weight
 -   7: [  1024,     1]     NONE vit.model.blocks.0.norm1.weight
 -   8: [  1024,     1]     NONE vit.model.blocks.0.norm1.bias
 -   9: [  1024,     1]     NONE vit.model.blocks.0.attn.v_bias
 -  10: [  1024,  1024]     NONE vit.model.blocks.0.attn.k_proj.weight
 -  11: [    64,  6400]     NONE vit.model.rope.freqs_cos
 -  12: [    64,  6400]     NONE vit.model.rope.freqs_sin
 -  13: [  1024,  1024]     NONE vit.model.blocks.0.attn.q_proj.weight
 -  14: [  1024,     1]     NONE vit.model.blocks.0.attn.q_bias
 -  15: [  1024,     1]     NONE vit.model.blocks.0.attn.inner_attn_ln.weight
 -  16: [  1024,     1]     NONE vit.model.blocks.0.attn.inner_attn_ln.bias
 -  17: [  1024,     1]     NONE vit.model.blocks.0.attn.proj.bias
 -  18: [  2730,  1024]     NONE vit.model.blocks.0.mlp.w3.weight
 -  19: [  1024,  2730]     NONE vit.model.blocks.0.mlp.w1.weight
 -  20: [  1024,     1]     NONE vit.model.blocks.0.norm2.weight
 -  21: [  1024,     1]     NONE vit.model.blocks.0.norm2.bias
 -  22: [  2730,     1]     NONE vit.model.blocks.0.mlp.w1.bias
 -  23: [  1024,  2730]     NONE vit.model.blocks.0.mlp.w2.weight
 -  24: [  2730,     1]     NONE vit.model.blocks.0.mlp.w2.bias
 -  25: [  2730,     1]     NONE vit.model.blocks.0.mlp.ffn_ln.weight
 -  26: [  2730,     1]     NONE vit.model.blocks.0.mlp.ffn_ln.bias
 -  27: [  1024,     1]     NONE vit.model.blocks.0.mlp.w3.bias
 -  28: [  1024,  6400]     NONE        pos_embed
========================================
Number of nodes is 98

Here's the main function of my GGML program, if it helps to identify the issues.

int main() {
    struct model_ctx cv_model;
    // Initialize the model contexts without backend buffer
    cv_model.backend = ggml_backend_cpu_init();
    ggml_backend_cpu_set_n_threads(cv_model.backend, 16);
    struct ggml_init_params weight_params {
        // Counted 515 tensors in cross vision encoder save file
        10000000,  // Memory size
        NULL,  // Memory buffer
        true,  // Don't allocate tensor data
    };

    struct ggml_init_params compute_params {
        GGML_DEFAULT_GRAPH_SIZE*ggml_tensor_overhead() + ggml_graph_overhead(),
        NULL,
        true,
    };

    cv_model.ctx_weight = ggml_init(weight_params);
    cv_model.ctx_compute = ggml_init(compute_params);

    const char * gguf_filename = "";
    bool rc = load_model(cv_model, gguf_filename);
    if (!rc) {return 1;}
    printf("Model loading succeeded\n");

    if (!get_input(cv_model)) {
        printf("Loading of image tensor failed\n");
        return 1;
    }
    printf("Image tensor had been loaded\n");
    
    struct ggml_cgraph * gf = cv_graph(cv_model, cv_model.model.input_image);
    ggml_graph_print(gf);
    printf("Number of nodes is %d\n", ggml_graph_n_nodes(gf));

    cv_model.compute_data = ggml_backend_alloc_ctx_tensors(cv_model.ctx_compute, cv_model.backend);

    // Not sure if I'm using the graph allocator correctly here
    ggml_gallocr_t allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(cv_model.backend));
    ggml_gallocr_reserve(allocr, gf);
    size_t compute_size = ggml_gallocr_get_buffer_size(allocr, 0);
    printf("Allocated %ld bytes of space for graph computation.\n", compute_size);  // Prints "32"
    ggml_gallocr_alloc_graph(allocr, gf);

    ggml_backend_graph_compute(cv_model.backend, gf);

    printf("Calculated a tensor with shape %ld %ld %ld %ld\n",
            cv_model.model.output_tensor->ne[0],
            cv_model.model.output_tensor->ne[1],
            cv_model.model.output_tensor->ne[2],
            cv_model.model.output_tensor->ne[3]);
    save_tensor(cv_model.model.output_tensor);

    ggml_backend_buffer_free(cv_model.weight_data);
    ggml_free(cv_model.ctx_weight);
    ggml_free(cv_model.ctx_compute);
    return 0;
}

As you can see, I am still a rookie at understanding and using GGML.
This is really an awesome library. I've had a lot of fun learning it, and I really appreciate the helpful examples and discussions from the contributor community.
Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions