metal : fix memory leak #2762

ggerganov · 2023-08-24T10:15:26Z

Should fix the memory increase observed when using Metal

li-plus · 2023-08-24T10:27:26Z

Thanks for the quick response! Just tried this version. Seems that there is still leak but much slower (at around 2MB/min). Could you observe that?

ggerganov · 2023-08-24T11:30:47Z

Yes, but I'm not sure what is causing it

jxy · 2023-08-24T15:34:41Z

[encoder release]?

jxy · 2023-08-24T17:30:52Z

Try this

diff --git a/ggml-metal.m b/ggml-metal.m
index d385340..5e17d72 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -522,12 +522,15 @@ void ggml_metal_graph_compute(
     const int n_cb = ctx->n_cb;
 
     NSMutableArray * command_buffers = [NSMutableArray arrayWithCapacity:n_cb];
+    NSMutableArray * encoders = [NSMutableArray arrayWithCapacity:n_cb];
 
     for (int i = 0; i < n_cb; ++i) {
         command_buffers[i] = [ctx->queue commandBuffer];
 
         // enqueue the command buffers in order to specify their execution order
         [command_buffers[i] enqueue];
+
+        encoders[i] = [command_buffers[i] computeCommandEncoderWithDescriptor: edesc];
     }
 
     // TODO: is this the best way to start threads?
@@ -543,7 +546,7 @@ void ggml_metal_graph_compute(
 
             id<MTLCommandBuffer> command_buffer = command_buffers[cb_idx];
 
-            id<MTLComputeCommandEncoder> encoder = [command_buffer computeCommandEncoderWithDescriptor: edesc];
+            id<MTLComputeCommandEncoder> encoder = encoders[cb_idx];
 
             const int node_start =                                      (cb_idx + 0) * n_nodes_per_cb;
             const int node_end   = MIN((cb_idx == n_cb - 1) ? n_nodes : (cb_idx + 1) * n_nodes_per_cb, n_nodes);
@@ -1109,7 +1112,6 @@ void ggml_metal_graph_compute(
 
             if (encoder != nil) {
                 [encoder endEncoding];
-                encoder = nil;
             }
 
             [command_buffer commit];
@@ -1122,7 +1124,7 @@ void ggml_metal_graph_compute(
     [command_buffers[n_cb - 1] waitUntilCompleted];
 
     // release resources
-    [queue release];
+    dispatch_release(queue);
 
     // check status of command buffers
     // needed to detect if the device ran out-of-memory for example (#1881)
@@ -1133,8 +1135,10 @@ void ggml_metal_graph_compute(
             GGML_ASSERT(false);
         }
 
+        [encoders[i] release];
         [command_buffers[i] release];
     }
 
+    [encoders release];
     [command_buffers release];
 }

I believe you need dispatch_release(queue) instead of [queue release]. The rest of the code I just moved encoders out and releases encoders the same as command buffers.

With this, for 13b q6_k, it stays at around 26M

ggerganov · 2023-08-24T18:03:31Z

Thanks @jxy! This seems to work

[queue release]; seems to work while with dispatch_release(queue); it seemed the mem usage increased very slowly (but not 100% sure).

Which option to choose?

jxy · 2023-08-24T21:11:21Z

Usually dispatch_release is recommended because the dispatch queue might still be busy. But in this case I guess there's not much difference.

I just ran it through the Xcode instrument, there are two sets of things left.

diff --git a/ggml-metal.m b/ggml-metal.m
index 45c3a1f..27383cc 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -233,9 +233,62 @@ struct ggml_metal_context * ggml_metal_init(int n_cb) {
 
 void ggml_metal_free(struct ggml_metal_context * ctx) {
     fprintf(stderr, "%s: deallocating\n", __func__);
+#define GGML_METAL_DEL_KERNEL(name) \
+        [ctx->function_##name release]; \
+        [ctx->pipeline_##name release];
+
+        GGML_METAL_DEL_KERNEL(add);
+        GGML_METAL_DEL_KERNEL(add_row);
+        GGML_METAL_DEL_KERNEL(mul);
+        GGML_METAL_DEL_KERNEL(mul_row);
+        GGML_METAL_DEL_KERNEL(scale);
+        GGML_METAL_DEL_KERNEL(silu);
+        GGML_METAL_DEL_KERNEL(relu);
+        GGML_METAL_DEL_KERNEL(gelu);
+        GGML_METAL_DEL_KERNEL(soft_max);
+        GGML_METAL_DEL_KERNEL(diag_mask_inf);
+        GGML_METAL_DEL_KERNEL(get_rows_f16);
+        GGML_METAL_DEL_KERNEL(get_rows_q4_0);
+        GGML_METAL_DEL_KERNEL(get_rows_q4_1);
+//        GGML_METAL_DEL_KERNEL(get_rows_q8_0);
+        GGML_METAL_DEL_KERNEL(get_rows_q2_K);
+        GGML_METAL_DEL_KERNEL(get_rows_q3_K);
+        GGML_METAL_DEL_KERNEL(get_rows_q4_K);
+        GGML_METAL_DEL_KERNEL(get_rows_q5_K);
+        GGML_METAL_DEL_KERNEL(get_rows_q6_K);
+        GGML_METAL_DEL_KERNEL(rms_norm);
+        GGML_METAL_DEL_KERNEL(norm);
+        GGML_METAL_DEL_KERNEL(mul_mat_f16_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q4_0_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q4_1_f32);
+//        GGML_METAL_DEL_KERNEL(mul_mat_q8_0_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q2_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q3_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q4_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q5_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q6_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_f16_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q4_0_f32);
+//        GGML_METAL_DEL_KERNEL(mul_mm_q8_0_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q4_1_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q2_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q3_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q4_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q5_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q6_K_f32);
+        GGML_METAL_DEL_KERNEL(rope);
+        GGML_METAL_DEL_KERNEL(alibi_f32);
+        GGML_METAL_DEL_KERNEL(cpy_f32_f16);
+        GGML_METAL_DEL_KERNEL(cpy_f32_f32);
+        GGML_METAL_DEL_KERNEL(cpy_f16_f16);
+
+#undef GGML_METAL_DEL_KERNEL
     for (int i = 0; i < ctx->n_buffers; ++i) {
         [ctx->buffers[i].metal release];
     }
+    [ctx->library release];
+    [ctx->queue release];
+    [ctx->device release];
     free(ctx);
 }
 
@@ -527,6 +580,8 @@ void ggml_metal_graph_compute(
         command_encoders[i] = [command_buffers[i] computeCommandEncoderWithDescriptor: edesc];
     }
 
+    [edesc release];
+
     // TODO: is this the best way to start threads?
     dispatch_queue_t queue = dispatch_queue_create("llama.cpp", DISPATCH_QUEUE_CONCURRENT);

In ggml_metal_graph_compute we need to also release the descriptor.
A complete clean up in ggml_metal_free.

ggerganov · 2023-08-25T07:18:47Z

There is still a very small memory leak somewhere - not sure it is in the Obj-C code though

jxy · 2023-08-25T14:59:08Z

Right, let's wait for all the command buffers.

diff --git a/ggml-metal.m b/ggml-metal.m
index eae3074..16ae5a0 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -1177,15 +1177,10 @@ void ggml_metal_graph_compute(
     // wait for all threads to finish
     dispatch_barrier_sync(queue, ^{});
 
-    [command_buffers[n_cb - 1] waitUntilCompleted];
-
-    // release resources
-    [edesc release];
-    [queue release];
-
     // check status of command buffers
     // needed to detect if the device ran out-of-memory for example (#1881)
     for (int i = 0; i < n_cb; i++) {
+        [command_buffers[i] waitUntilCompleted];
         MTLCommandBufferStatus status = (MTLCommandBufferStatus) [command_buffers[i] status];
         if (status != MTLCommandBufferStatusCompleted) {
             fprintf(stderr, "%s: command buffer %d failed with status %lu\n", __func__, i, status);
@@ -1196,6 +1191,10 @@ void ggml_metal_graph_compute(
         [command_buffers[i] release];
     }
 
+    // release resources
+    [edesc release];
+    [queue release];
+
     [command_encoders release];
     [command_buffers release];
 }

after this change, generating a few tokens has the following allocation

dispatch_async uses heap allocation, and GCD only clean things up periodically. Typically these allocations gets freed within a minute or two.

ggerganov · 2023-08-25T16:05:49Z

Let me know if you observe any more leaks - I haven't tested long runs yet

li-plus · 2023-08-26T01:11:28Z

Let me know if you observe any more leaks - I haven't tested long runs yet

It's still leaking. I run for 9 hours. Memory increased by 800MB. Using this commit 9e2ec8e.

ggerganov · 2023-08-26T18:24:06Z

Any ideas? How do we fix this?

slaren · 2023-08-26T18:54:21Z

This may be obvious, but leak sanitizer may help if you haven't tried that already (-fsanitize=leak, should also be included with -fsanitize=address).

ggerganov · 2023-08-27T07:31:44Z

-fsanitize=address does not show anything and -fsanitize=leak is not supported with MacOS clang. On x64 it does not show anything as well.

Tried running valgrind with 7B model - it's hopeless. Very very slow

Think we need to use a tiny 60M llama to be able to generate large amounts of tokens without waiting for hours. I suspect the leak might not be related to Metal. C++ container allocation/deallocation could be fragmenting the memory in some way instead. Therefore we should first make sure that very long CPU (and probably CUDA) generation do not exhibit the same behaviour. But to do that in a meaningful way, we need a much smaller model to test with

Edit: nvm, I seems definitely something related to Metal. Can't figure out the cause

ggerganov · 2023-08-27T08:08:19Z

Running with 1 CPU thread (i.e. one command buffer) seems to reduce the speed of the leak, so it makes me think it is something in the dispatch_async block in ggml-metal.m.

Could it be MTLSizeMake is allocating on the heap? I don't see a way to free the object though, so probably not.

ggerganov · 2023-08-27T16:05:18Z

In whisper.cpp, a proposal to use @autoreleasepool has been made and it seems to resolve similar memory leak:

ggml-org/whisper.cpp#1202 (comment)

We should try this here too

jhen0409 · 2023-08-27T23:43:58Z

In whisper.cpp, a proposal to use @autoreleasepool has been made and it seems to resolve similar memory leak:

ggerganov/whisper.cpp#1202 (comment)

We should try this here too

It is still leaking by wrap @autoreleasepool {} for ggml_metal_graph_compute, but I believe @li-plus have been tried as mentioned in the issue.

After I also try to reuse the dispatch_queue_t, it seems works to me:

diff --git a/ggml-metal.m b/ggml-metal.m
index e825b63..60141f2 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -39,6 +39,8 @@
     id<MTLCommandQueue> queue;
     id<MTLLibrary>      library;
 
+    dispatch_queue_t d_queue;
+
     int n_buffers;
     struct ggml_metal_buffer buffers[GGML_METAL_MAX_BUFFERS];
 
@@ -120,6 +122,7 @@ @implementation GGMLMetalClass
     ctx->n_buffers = 0;
     ctx->concur_list_len = 0;
 
+    ctx->d_queue = dispatch_queue_create("llama.cpp", DISPATCH_QUEUE_CONCURRENT);
 
 #if 0
     // compile from source string and show compile log
@@ -297,6 +300,7 @@ void ggml_metal_free(struct ggml_metal_context * ctx) {
     [ctx->library release];
     [ctx->queue release];
     [ctx->device release];
+    dispatch_release(ctx->d_queue);
 
     free(ctx);
 }
@@ -563,6 +567,8 @@ void ggml_metal_graph_compute(
                struct ggml_cgraph * gf) {
     metal_printf("%s: evaluating graph\n", __func__);
 
+    @autoreleasepool {
+
     // if there is ctx->concur_list, dispatch concurrently
     // else fallback to serial dispatch
     MTLComputePassDescriptor * edesc = MTLComputePassDescriptor.computePassDescriptor;
@@ -589,13 +595,11 @@ void ggml_metal_graph_compute(
         command_encoders[i] = [command_buffers[i] computeCommandEncoderWithDescriptor: edesc];
     }
 
-    // TODO: is this the best way to start threads?
-    dispatch_queue_t queue = dispatch_queue_create("llama.cpp", DISPATCH_QUEUE_CONCURRENT);
 
     for (int cb_idx = 0; cb_idx < n_cb; ++cb_idx) {
         const int n_nodes_per_cb = (n_nodes + n_cb - 1) / n_cb;
 
-        dispatch_async(queue, ^{
+        dispatch_async(ctx->d_queue, ^{
             size_t offs_src0 = 0;
             size_t offs_src1 = 0;
             size_t offs_dst  = 0;
@@ -1175,7 +1179,7 @@ void ggml_metal_graph_compute(
     }
 
     // wait for all threads to finish
-    dispatch_barrier_sync(queue, ^{});
+    dispatch_barrier_sync(ctx->d_queue, ^{});
 
     // check status of command buffers
     // needed to detect if the device ran out-of-memory for example (#1881)
@@ -1188,14 +1192,7 @@ void ggml_metal_graph_compute(
             GGML_ASSERT(false);
         }
 
-        [command_encoders[i] release];
-        [command_buffers[i] release];
     }
 
-    // release resources
-    [edesc release];
-    [queue release];
-
-    [command_encoders release];
-    [command_buffers release];
+    }
 }

But I'm not very clear for the main cause yet.

ggerganov · 2023-08-28T07:04:55Z

I think the leak speed is proportional to the number of threads -t
With -t 1 there is almost no leak and with -t 16 on the M2 Studio it's noticable.

Note that when using Metal, the threads play no role for the computation - they are just used to create that many command buffers in parallel, which in practice does not offer any benefit, but in the Apple docs it says that it is a good practice to do it

15M tinyllama is an example

ggerganov · 2023-08-28T07:57:33Z

After I also try to reuse the dispatch_queue_t, it seems works to me:

You are right, it seems that after a few minutes of generation it settles to a constant memory usage, so there is no longer a leak.

I've made another small change to reuse the arrays for the command buffers and encoders, which is probably not necessary, but I think it is better like this.

I've also setup tinyllama 15M from the llama2.c project and tested high-speed generation for several minutes. At 1100 t/s with CPU-only and 700 t/s with Metal, there is no memory usage increase observed in top after 10 minutes of generation in both cases, so I'm fairly confident that we don't leak with CPU and Metal anymore

ggerganov · 2023-08-28T07:59:43Z

@jxy @jhen0409 @li-plus Thank you for the help!

* metal : fix memory leak * metal : fix encoders memory leak * metal : clean up more memory resources * metal : fix more leaks * metal : reuse dispatch queue + autoreleasepool * metal : reuse array for command buffers and encoders * ggml : assert for odd number of blocks on ARM 15M tinyllama is an example

lshzh-ww approved these changes Aug 24, 2023

View reviewed changes

ggerganov force-pushed the metal-fix-memory-leak branch from 53dea11 to e778b10 Compare August 25, 2023 07:16

ggerganov mentioned this pull request Aug 25, 2023

Rapid memory leak (2MB/s) using Metal backend #2761

Closed

4 tasks

ggerganov force-pushed the metal-fix-memory-leak branch from 9e2ec8e to de94ca3 Compare August 27, 2023 07:12

ggerganov added 5 commits August 28, 2023 09:53

metal : fix memory leak

f8e816e

metal : fix encoders memory leak

5919629

metal : clean up more memory resources

e7c4ccc

metal : fix more leaks

67dd746

metal : reuse dispatch queue + autoreleasepool

43a8a62

ggerganov force-pushed the metal-fix-memory-leak branch from de94ca3 to 43a8a62 Compare August 28, 2023 06:58

ggerganov added 2 commits August 28, 2023 10:49

metal : reuse array for command buffers and encoders

fffd167

ggml : assert for odd number of blocks on ARM

ddfa865

15M tinyllama is an example

ggerganov merged commit f55538c into master Aug 28, 2023

ggerganov deleted the metal-fix-memory-leak branch August 28, 2023 07:59

metal : fix memory leak #2762

metal : fix memory leak #2762

Uh oh!

Conversation

ggerganov commented Aug 24, 2023

Uh oh!

li-plus commented Aug 24, 2023

Uh oh!

ggerganov commented Aug 24, 2023

Uh oh!

jxy commented Aug 24, 2023

Uh oh!

jxy commented Aug 24, 2023

Uh oh!

ggerganov commented Aug 24, 2023

Uh oh!

jxy commented Aug 24, 2023

Uh oh!

ggerganov commented Aug 25, 2023

Uh oh!

jxy commented Aug 25, 2023

Uh oh!

ggerganov commented Aug 25, 2023

Uh oh!

li-plus commented Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 26, 2023

Uh oh!

slaren commented Aug 26, 2023

Uh oh!

ggerganov commented Aug 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 27, 2023

Uh oh!

ggerganov commented Aug 27, 2023

Uh oh!

jhen0409 commented Aug 27, 2023

Uh oh!

ggerganov commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 28, 2023

Uh oh!

ggerganov commented Aug 28, 2023

Uh oh!

Uh oh!

li-plus commented Aug 26, 2023 •

edited

Loading

ggerganov commented Aug 27, 2023 •

edited

Loading

ggerganov commented Aug 28, 2023 •

edited

Loading