Skip to content

metal : fix memory leak #2762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 28, 2023
Merged

metal : fix memory leak #2762

merged 7 commits into from
Aug 28, 2023

Conversation

ggerganov
Copy link
Member

close #2761

Should fix the memory increase observed when using Metal

cc @li-plus

@li-plus
Copy link
Contributor

li-plus commented Aug 24, 2023

Thanks for the quick response! Just tried this version. Seems that there is still leak but much slower (at around 2MB/min). Could you observe that?

@ggerganov
Copy link
Member Author

Yes, but I'm not sure what is causing it

@jxy
Copy link
Contributor

jxy commented Aug 24, 2023

[encoder release]?

@jxy
Copy link
Contributor

jxy commented Aug 24, 2023

Try this

diff --git a/ggml-metal.m b/ggml-metal.m
index d385340..5e17d72 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -522,12 +522,15 @@ void ggml_metal_graph_compute(
     const int n_cb = ctx->n_cb;
 
     NSMutableArray * command_buffers = [NSMutableArray arrayWithCapacity:n_cb];
+    NSMutableArray * encoders = [NSMutableArray arrayWithCapacity:n_cb];
 
     for (int i = 0; i < n_cb; ++i) {
         command_buffers[i] = [ctx->queue commandBuffer];
 
         // enqueue the command buffers in order to specify their execution order
         [command_buffers[i] enqueue];
+
+        encoders[i] = [command_buffers[i] computeCommandEncoderWithDescriptor: edesc];
     }
 
     // TODO: is this the best way to start threads?
@@ -543,7 +546,7 @@ void ggml_metal_graph_compute(
 
             id<MTLCommandBuffer> command_buffer = command_buffers[cb_idx];
 
-            id<MTLComputeCommandEncoder> encoder = [command_buffer computeCommandEncoderWithDescriptor: edesc];
+            id<MTLComputeCommandEncoder> encoder = encoders[cb_idx];
 
             const int node_start =                                      (cb_idx + 0) * n_nodes_per_cb;
             const int node_end   = MIN((cb_idx == n_cb - 1) ? n_nodes : (cb_idx + 1) * n_nodes_per_cb, n_nodes);
@@ -1109,7 +1112,6 @@ void ggml_metal_graph_compute(
 
             if (encoder != nil) {
                 [encoder endEncoding];
-                encoder = nil;
             }
 
             [command_buffer commit];
@@ -1122,7 +1124,7 @@ void ggml_metal_graph_compute(
     [command_buffers[n_cb - 1] waitUntilCompleted];
 
     // release resources
-    [queue release];
+    dispatch_release(queue);
 
     // check status of command buffers
     // needed to detect if the device ran out-of-memory for example (#1881)
@@ -1133,8 +1135,10 @@ void ggml_metal_graph_compute(
             GGML_ASSERT(false);
         }
 
+        [encoders[i] release];
         [command_buffers[i] release];
     }
 
+    [encoders release];
     [command_buffers release];
 }

I believe you need dispatch_release(queue) instead of [queue release]. The rest of the code I just moved encoders out and releases encoders the same as command buffers.

With this, for 13b q6_k, it stays at around 26M

@ggerganov
Copy link
Member Author

Thanks @jxy! This seems to work

[queue release]; seems to work while with dispatch_release(queue); it seemed the mem usage increased very slowly (but not 100% sure).

Which option to choose?

@jxy
Copy link
Contributor

jxy commented Aug 24, 2023

Usually dispatch_release is recommended because the dispatch queue might still be busy. But in this case I guess there's not much difference.

I just ran it through the Xcode instrument, there are two sets of things left.

diff --git a/ggml-metal.m b/ggml-metal.m
index 45c3a1f..27383cc 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -233,9 +233,62 @@ struct ggml_metal_context * ggml_metal_init(int n_cb) {
 
 void ggml_metal_free(struct ggml_metal_context * ctx) {
     fprintf(stderr, "%s: deallocating\n", __func__);
+#define GGML_METAL_DEL_KERNEL(name) \
+        [ctx->function_##name release]; \
+        [ctx->pipeline_##name release];
+
+        GGML_METAL_DEL_KERNEL(add);
+        GGML_METAL_DEL_KERNEL(add_row);
+        GGML_METAL_DEL_KERNEL(mul);
+        GGML_METAL_DEL_KERNEL(mul_row);
+        GGML_METAL_DEL_KERNEL(scale);
+        GGML_METAL_DEL_KERNEL(silu);
+        GGML_METAL_DEL_KERNEL(relu);
+        GGML_METAL_DEL_KERNEL(gelu);
+        GGML_METAL_DEL_KERNEL(soft_max);
+        GGML_METAL_DEL_KERNEL(diag_mask_inf);
+        GGML_METAL_DEL_KERNEL(get_rows_f16);
+        GGML_METAL_DEL_KERNEL(get_rows_q4_0);
+        GGML_METAL_DEL_KERNEL(get_rows_q4_1);
+//        GGML_METAL_DEL_KERNEL(get_rows_q8_0);
+        GGML_METAL_DEL_KERNEL(get_rows_q2_K);
+        GGML_METAL_DEL_KERNEL(get_rows_q3_K);
+        GGML_METAL_DEL_KERNEL(get_rows_q4_K);
+        GGML_METAL_DEL_KERNEL(get_rows_q5_K);
+        GGML_METAL_DEL_KERNEL(get_rows_q6_K);
+        GGML_METAL_DEL_KERNEL(rms_norm);
+        GGML_METAL_DEL_KERNEL(norm);
+        GGML_METAL_DEL_KERNEL(mul_mat_f16_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q4_0_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q4_1_f32);
+//        GGML_METAL_DEL_KERNEL(mul_mat_q8_0_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q2_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q3_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q4_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q5_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mat_q6_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_f16_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q4_0_f32);
+//        GGML_METAL_DEL_KERNEL(mul_mm_q8_0_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q4_1_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q2_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q3_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q4_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q5_K_f32);
+        GGML_METAL_DEL_KERNEL(mul_mm_q6_K_f32);
+        GGML_METAL_DEL_KERNEL(rope);
+        GGML_METAL_DEL_KERNEL(alibi_f32);
+        GGML_METAL_DEL_KERNEL(cpy_f32_f16);
+        GGML_METAL_DEL_KERNEL(cpy_f32_f32);
+        GGML_METAL_DEL_KERNEL(cpy_f16_f16);
+
+#undef GGML_METAL_DEL_KERNEL
     for (int i = 0; i < ctx->n_buffers; ++i) {
         [ctx->buffers[i].metal release];
     }
+    [ctx->library release];
+    [ctx->queue release];
+    [ctx->device release];
     free(ctx);
 }
 
@@ -527,6 +580,8 @@ void ggml_metal_graph_compute(
         command_encoders[i] = [command_buffers[i] computeCommandEncoderWithDescriptor: edesc];
     }
 
+    [edesc release];
+
     // TODO: is this the best way to start threads?
     dispatch_queue_t queue = dispatch_queue_create("llama.cpp", DISPATCH_QUEUE_CONCURRENT);
 
  1. In ggml_metal_graph_compute we need to also release the descriptor.
  2. A complete clean up in ggml_metal_free.

@ggerganov ggerganov force-pushed the metal-fix-memory-leak branch from 53dea11 to e778b10 Compare August 25, 2023 07:16
@ggerganov
Copy link
Member Author

There is still a very small memory leak somewhere - not sure it is in the Obj-C code though

@jxy
Copy link
Contributor

jxy commented Aug 25, 2023

Right, let's wait for all the command buffers.

diff --git a/ggml-metal.m b/ggml-metal.m
index eae3074..16ae5a0 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -1177,15 +1177,10 @@ void ggml_metal_graph_compute(
     // wait for all threads to finish
     dispatch_barrier_sync(queue, ^{});
 
-    [command_buffers[n_cb - 1] waitUntilCompleted];
-
-    // release resources
-    [edesc release];
-    [queue release];
-
     // check status of command buffers
     // needed to detect if the device ran out-of-memory for example (#1881)
     for (int i = 0; i < n_cb; i++) {
+        [command_buffers[i] waitUntilCompleted];
         MTLCommandBufferStatus status = (MTLCommandBufferStatus) [command_buffers[i] status];
         if (status != MTLCommandBufferStatusCompleted) {
             fprintf(stderr, "%s: command buffer %d failed with status %lu\n", __func__, i, status);
@@ -1196,6 +1191,10 @@ void ggml_metal_graph_compute(
         [command_buffers[i] release];
     }
 
+    // release resources
+    [edesc release];
+    [queue release];
+
     [command_encoders release];
     [command_buffers release];
 }

after this change, generating a few tokens has the following allocation
Screenshot 2023-08-25 at 9 32 43 AM
dispatch_async uses heap allocation, and GCD only clean things up periodically. Typically these allocations gets freed within a minute or two.
Screenshot 2023-08-25 at 9 58 03 AM

@ggerganov
Copy link
Member Author

Let me know if you observe any more leaks - I haven't tested long runs yet

@li-plus
Copy link
Contributor

li-plus commented Aug 26, 2023

Let me know if you observe any more leaks - I haven't tested long runs yet

It's still leaking. I run for 9 hours. Memory increased by 800MB. Using this commit 9e2ec8e.

@ggerganov
Copy link
Member Author

Any ideas? How do we fix this?

@slaren
Copy link
Member

slaren commented Aug 26, 2023

This may be obvious, but leak sanitizer may help if you haven't tried that already (-fsanitize=leak, should also be included with -fsanitize=address).

@ggerganov ggerganov force-pushed the metal-fix-memory-leak branch from 9e2ec8e to de94ca3 Compare August 27, 2023 07:12
@ggerganov
Copy link
Member Author

ggerganov commented Aug 27, 2023

-fsanitize=address does not show anything and -fsanitize=leak is not supported with MacOS clang. On x64 it does not show anything as well.

Tried running valgrind with 7B model - it's hopeless. Very very slow

Think we need to use a tiny 60M llama to be able to generate large amounts of tokens without waiting for hours. I suspect the leak might not be related to Metal. C++ container allocation/deallocation could be fragmenting the memory in some way instead. Therefore we should first make sure that very long CPU (and probably CUDA) generation do not exhibit the same behaviour. But to do that in a meaningful way, we need a much smaller model to test with

Edit: nvm, I seems definitely something related to Metal. Can't figure out the cause

@ggerganov
Copy link
Member Author

Running with 1 CPU thread (i.e. one command buffer) seems to reduce the speed of the leak, so it makes me think it is something in the dispatch_async block in ggml-metal.m.

Could it be MTLSizeMake is allocating on the heap? I don't see a way to free the object though, so probably not.

@ggerganov
Copy link
Member Author

In whisper.cpp, a proposal to use @autoreleasepool has been made and it seems to resolve similar memory leak:

ggml-org/whisper.cpp#1202 (comment)

We should try this here too

@jhen0409
Copy link
Collaborator

In whisper.cpp, a proposal to use @autoreleasepool has been made and it seems to resolve similar memory leak:

ggerganov/whisper.cpp#1202 (comment)

We should try this here too

It is still leaking by wrap @autoreleasepool {} for ggml_metal_graph_compute, but I believe @li-plus have been tried as mentioned in the issue.

After I also try to reuse the dispatch_queue_t, it seems works to me:

diff --git a/ggml-metal.m b/ggml-metal.m
index e825b63..60141f2 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -39,6 +39,8 @@
     id<MTLCommandQueue> queue;
     id<MTLLibrary>      library;
 
+    dispatch_queue_t d_queue;
+
     int n_buffers;
     struct ggml_metal_buffer buffers[GGML_METAL_MAX_BUFFERS];
 
@@ -120,6 +122,7 @@ @implementation GGMLMetalClass
     ctx->n_buffers = 0;
     ctx->concur_list_len = 0;
 
+    ctx->d_queue = dispatch_queue_create("llama.cpp", DISPATCH_QUEUE_CONCURRENT);
 
 #if 0
     // compile from source string and show compile log
@@ -297,6 +300,7 @@ void ggml_metal_free(struct ggml_metal_context * ctx) {
     [ctx->library release];
     [ctx->queue release];
     [ctx->device release];
+    dispatch_release(ctx->d_queue);
 
     free(ctx);
 }
@@ -563,6 +567,8 @@ void ggml_metal_graph_compute(
                struct ggml_cgraph * gf) {
     metal_printf("%s: evaluating graph\n", __func__);
 
+    @autoreleasepool {
+
     // if there is ctx->concur_list, dispatch concurrently
     // else fallback to serial dispatch
     MTLComputePassDescriptor * edesc = MTLComputePassDescriptor.computePassDescriptor;
@@ -589,13 +595,11 @@ void ggml_metal_graph_compute(
         command_encoders[i] = [command_buffers[i] computeCommandEncoderWithDescriptor: edesc];
     }
 
-    // TODO: is this the best way to start threads?
-    dispatch_queue_t queue = dispatch_queue_create("llama.cpp", DISPATCH_QUEUE_CONCURRENT);
 
     for (int cb_idx = 0; cb_idx < n_cb; ++cb_idx) {
         const int n_nodes_per_cb = (n_nodes + n_cb - 1) / n_cb;
 
-        dispatch_async(queue, ^{
+        dispatch_async(ctx->d_queue, ^{
             size_t offs_src0 = 0;
             size_t offs_src1 = 0;
             size_t offs_dst  = 0;
@@ -1175,7 +1179,7 @@ void ggml_metal_graph_compute(
     }
 
     // wait for all threads to finish
-    dispatch_barrier_sync(queue, ^{});
+    dispatch_barrier_sync(ctx->d_queue, ^{});
 
     // check status of command buffers
     // needed to detect if the device ran out-of-memory for example (#1881)
@@ -1188,14 +1192,7 @@ void ggml_metal_graph_compute(
             GGML_ASSERT(false);
         }
 
-        [command_encoders[i] release];
-        [command_buffers[i] release];
     }
 
-    // release resources
-    [edesc release];
-    [queue release];
-
-    [command_encoders release];
-    [command_buffers release];
+    }
 }

But I'm not very clear for the main cause yet.

@ggerganov ggerganov force-pushed the metal-fix-memory-leak branch from de94ca3 to 43a8a62 Compare August 28, 2023 06:58
@ggerganov
Copy link
Member Author

ggerganov commented Aug 28, 2023

I think the leak speed is proportional to the number of threads -t
With -t 1 there is almost no leak and with -t 16 on the M2 Studio it's noticable.

Note that when using Metal, the threads play no role for the computation - they are just used to create that many command buffers in parallel, which in practice does not offer any benefit, but in the Apple docs it says that it is a good practice to do it

@ggerganov
Copy link
Member Author

After I also try to reuse the dispatch_queue_t, it seems works to me:

You are right, it seems that after a few minutes of generation it settles to a constant memory usage, so there is no longer a leak.

I've made another small change to reuse the arrays for the command buffers and encoders, which is probably not necessary, but I think it is better like this.

I've also setup tinyllama 15M from the llama2.c project and tested high-speed generation for several minutes. At 1100 t/s with CPU-only and 700 t/s with Metal, there is no memory usage increase observed in top after 10 minutes of generation in both cases, so I'm fairly confident that we don't leak with CPU and Metal anymore

@ggerganov ggerganov merged commit f55538c into master Aug 28, 2023
@ggerganov
Copy link
Member Author

@jxy @jhen0409 @li-plus Thank you for the help!

@ggerganov ggerganov deleted the metal-fix-memory-leak branch August 28, 2023 07:59
akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
* metal : fix memory leak

* metal : fix encoders memory leak

* metal : clean up more memory resources

* metal : fix more leaks

* metal : reuse dispatch queue + autoreleasepool

* metal : reuse array for command buffers and encoders

* ggml : assert for odd number of blocks on ARM

15M tinyllama is an example
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rapid memory leak (2MB/s) using Metal backend
6 participants