-
Notifications
You must be signed in to change notification settings - Fork 12k
metal : fix memory leak #2762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metal : fix memory leak #2762
Conversation
Thanks for the quick response! Just tried this version. Seems that there is still leak but much slower (at around 2MB/min). Could you observe that? |
Yes, but I'm not sure what is causing it |
|
Try this diff --git a/ggml-metal.m b/ggml-metal.m
index d385340..5e17d72 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -522,12 +522,15 @@ void ggml_metal_graph_compute(
const int n_cb = ctx->n_cb;
NSMutableArray * command_buffers = [NSMutableArray arrayWithCapacity:n_cb];
+ NSMutableArray * encoders = [NSMutableArray arrayWithCapacity:n_cb];
for (int i = 0; i < n_cb; ++i) {
command_buffers[i] = [ctx->queue commandBuffer];
// enqueue the command buffers in order to specify their execution order
[command_buffers[i] enqueue];
+
+ encoders[i] = [command_buffers[i] computeCommandEncoderWithDescriptor: edesc];
}
// TODO: is this the best way to start threads?
@@ -543,7 +546,7 @@ void ggml_metal_graph_compute(
id<MTLCommandBuffer> command_buffer = command_buffers[cb_idx];
- id<MTLComputeCommandEncoder> encoder = [command_buffer computeCommandEncoderWithDescriptor: edesc];
+ id<MTLComputeCommandEncoder> encoder = encoders[cb_idx];
const int node_start = (cb_idx + 0) * n_nodes_per_cb;
const int node_end = MIN((cb_idx == n_cb - 1) ? n_nodes : (cb_idx + 1) * n_nodes_per_cb, n_nodes);
@@ -1109,7 +1112,6 @@ void ggml_metal_graph_compute(
if (encoder != nil) {
[encoder endEncoding];
- encoder = nil;
}
[command_buffer commit];
@@ -1122,7 +1124,7 @@ void ggml_metal_graph_compute(
[command_buffers[n_cb - 1] waitUntilCompleted];
// release resources
- [queue release];
+ dispatch_release(queue);
// check status of command buffers
// needed to detect if the device ran out-of-memory for example (#1881)
@@ -1133,8 +1135,10 @@ void ggml_metal_graph_compute(
GGML_ASSERT(false);
}
+ [encoders[i] release];
[command_buffers[i] release];
}
+ [encoders release];
[command_buffers release];
} I believe you need With this, for 13b q6_k, it stays at around 26M |
Thanks @jxy! This seems to work
Which option to choose? |
Usually I just ran it through the Xcode instrument, there are two sets of things left. diff --git a/ggml-metal.m b/ggml-metal.m
index 45c3a1f..27383cc 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -233,9 +233,62 @@ struct ggml_metal_context * ggml_metal_init(int n_cb) {
void ggml_metal_free(struct ggml_metal_context * ctx) {
fprintf(stderr, "%s: deallocating\n", __func__);
+#define GGML_METAL_DEL_KERNEL(name) \
+ [ctx->function_##name release]; \
+ [ctx->pipeline_##name release];
+
+ GGML_METAL_DEL_KERNEL(add);
+ GGML_METAL_DEL_KERNEL(add_row);
+ GGML_METAL_DEL_KERNEL(mul);
+ GGML_METAL_DEL_KERNEL(mul_row);
+ GGML_METAL_DEL_KERNEL(scale);
+ GGML_METAL_DEL_KERNEL(silu);
+ GGML_METAL_DEL_KERNEL(relu);
+ GGML_METAL_DEL_KERNEL(gelu);
+ GGML_METAL_DEL_KERNEL(soft_max);
+ GGML_METAL_DEL_KERNEL(diag_mask_inf);
+ GGML_METAL_DEL_KERNEL(get_rows_f16);
+ GGML_METAL_DEL_KERNEL(get_rows_q4_0);
+ GGML_METAL_DEL_KERNEL(get_rows_q4_1);
+// GGML_METAL_DEL_KERNEL(get_rows_q8_0);
+ GGML_METAL_DEL_KERNEL(get_rows_q2_K);
+ GGML_METAL_DEL_KERNEL(get_rows_q3_K);
+ GGML_METAL_DEL_KERNEL(get_rows_q4_K);
+ GGML_METAL_DEL_KERNEL(get_rows_q5_K);
+ GGML_METAL_DEL_KERNEL(get_rows_q6_K);
+ GGML_METAL_DEL_KERNEL(rms_norm);
+ GGML_METAL_DEL_KERNEL(norm);
+ GGML_METAL_DEL_KERNEL(mul_mat_f16_f32);
+ GGML_METAL_DEL_KERNEL(mul_mat_q4_0_f32);
+ GGML_METAL_DEL_KERNEL(mul_mat_q4_1_f32);
+// GGML_METAL_DEL_KERNEL(mul_mat_q8_0_f32);
+ GGML_METAL_DEL_KERNEL(mul_mat_q2_K_f32);
+ GGML_METAL_DEL_KERNEL(mul_mat_q3_K_f32);
+ GGML_METAL_DEL_KERNEL(mul_mat_q4_K_f32);
+ GGML_METAL_DEL_KERNEL(mul_mat_q5_K_f32);
+ GGML_METAL_DEL_KERNEL(mul_mat_q6_K_f32);
+ GGML_METAL_DEL_KERNEL(mul_mm_f16_f32);
+ GGML_METAL_DEL_KERNEL(mul_mm_q4_0_f32);
+// GGML_METAL_DEL_KERNEL(mul_mm_q8_0_f32);
+ GGML_METAL_DEL_KERNEL(mul_mm_q4_1_f32);
+ GGML_METAL_DEL_KERNEL(mul_mm_q2_K_f32);
+ GGML_METAL_DEL_KERNEL(mul_mm_q3_K_f32);
+ GGML_METAL_DEL_KERNEL(mul_mm_q4_K_f32);
+ GGML_METAL_DEL_KERNEL(mul_mm_q5_K_f32);
+ GGML_METAL_DEL_KERNEL(mul_mm_q6_K_f32);
+ GGML_METAL_DEL_KERNEL(rope);
+ GGML_METAL_DEL_KERNEL(alibi_f32);
+ GGML_METAL_DEL_KERNEL(cpy_f32_f16);
+ GGML_METAL_DEL_KERNEL(cpy_f32_f32);
+ GGML_METAL_DEL_KERNEL(cpy_f16_f16);
+
+#undef GGML_METAL_DEL_KERNEL
for (int i = 0; i < ctx->n_buffers; ++i) {
[ctx->buffers[i].metal release];
}
+ [ctx->library release];
+ [ctx->queue release];
+ [ctx->device release];
free(ctx);
}
@@ -527,6 +580,8 @@ void ggml_metal_graph_compute(
command_encoders[i] = [command_buffers[i] computeCommandEncoderWithDescriptor: edesc];
}
+ [edesc release];
+
// TODO: is this the best way to start threads?
dispatch_queue_t queue = dispatch_queue_create("llama.cpp", DISPATCH_QUEUE_CONCURRENT);
|
53dea11
to
e778b10
Compare
There is still a very small memory leak somewhere - not sure it is in the Obj-C code though |
Let me know if you observe any more leaks - I haven't tested long runs yet |
It's still leaking. I run for 9 hours. Memory increased by 800MB. Using this commit 9e2ec8e. |
Any ideas? How do we fix this? |
This may be obvious, but leak sanitizer may help if you haven't tried that already ( |
9e2ec8e
to
de94ca3
Compare
Tried running Think we need to use a tiny 60M llama to be able to generate large amounts of tokens without waiting for hours. I suspect the leak might not be related to Metal. C++ container allocation/deallocation could be fragmenting the memory in some way instead. Therefore we should first make sure that very long CPU (and probably CUDA) generation do not exhibit the same behaviour. But to do that in a meaningful way, we need a much smaller model to test with Edit: nvm, I seems definitely something related to Metal. Can't figure out the cause |
Running with 1 CPU thread (i.e. one command buffer) seems to reduce the speed of the leak, so it makes me think it is something in the Could it be |
In ggml-org/whisper.cpp#1202 (comment) We should try this here too |
It is still leaking by wrap After I also try to reuse the diff --git a/ggml-metal.m b/ggml-metal.m
index e825b63..60141f2 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -39,6 +39,8 @@
id<MTLCommandQueue> queue;
id<MTLLibrary> library;
+ dispatch_queue_t d_queue;
+
int n_buffers;
struct ggml_metal_buffer buffers[GGML_METAL_MAX_BUFFERS];
@@ -120,6 +122,7 @@ @implementation GGMLMetalClass
ctx->n_buffers = 0;
ctx->concur_list_len = 0;
+ ctx->d_queue = dispatch_queue_create("llama.cpp", DISPATCH_QUEUE_CONCURRENT);
#if 0
// compile from source string and show compile log
@@ -297,6 +300,7 @@ void ggml_metal_free(struct ggml_metal_context * ctx) {
[ctx->library release];
[ctx->queue release];
[ctx->device release];
+ dispatch_release(ctx->d_queue);
free(ctx);
}
@@ -563,6 +567,8 @@ void ggml_metal_graph_compute(
struct ggml_cgraph * gf) {
metal_printf("%s: evaluating graph\n", __func__);
+ @autoreleasepool {
+
// if there is ctx->concur_list, dispatch concurrently
// else fallback to serial dispatch
MTLComputePassDescriptor * edesc = MTLComputePassDescriptor.computePassDescriptor;
@@ -589,13 +595,11 @@ void ggml_metal_graph_compute(
command_encoders[i] = [command_buffers[i] computeCommandEncoderWithDescriptor: edesc];
}
- // TODO: is this the best way to start threads?
- dispatch_queue_t queue = dispatch_queue_create("llama.cpp", DISPATCH_QUEUE_CONCURRENT);
for (int cb_idx = 0; cb_idx < n_cb; ++cb_idx) {
const int n_nodes_per_cb = (n_nodes + n_cb - 1) / n_cb;
- dispatch_async(queue, ^{
+ dispatch_async(ctx->d_queue, ^{
size_t offs_src0 = 0;
size_t offs_src1 = 0;
size_t offs_dst = 0;
@@ -1175,7 +1179,7 @@ void ggml_metal_graph_compute(
}
// wait for all threads to finish
- dispatch_barrier_sync(queue, ^{});
+ dispatch_barrier_sync(ctx->d_queue, ^{});
// check status of command buffers
// needed to detect if the device ran out-of-memory for example (#1881)
@@ -1188,14 +1192,7 @@ void ggml_metal_graph_compute(
GGML_ASSERT(false);
}
- [command_encoders[i] release];
- [command_buffers[i] release];
}
- // release resources
- [edesc release];
- [queue release];
-
- [command_encoders release];
- [command_buffers release];
+ }
} But I'm not very clear for the main cause yet. |
de94ca3
to
43a8a62
Compare
I think the leak speed is proportional to the number of threads Note that when using Metal, the threads play no role for the computation - they are just used to create that many command buffers in parallel, which in practice does not offer any benefit, but in the Apple docs it says that it is a good practice to do it |
15M tinyllama is an example
You are right, it seems that after a few minutes of generation it settles to a constant memory usage, so there is no longer a leak. I've made another small change to reuse the arrays for the command buffers and encoders, which is probably not necessary, but I think it is better like this. I've also setup tinyllama 15M from the |
* metal : fix memory leak * metal : fix encoders memory leak * metal : clean up more memory resources * metal : fix more leaks * metal : reuse dispatch queue + autoreleasepool * metal : reuse array for command buffers and encoders * ggml : assert for odd number of blocks on ARM 15M tinyllama is an example
close #2761
Should fix the memory increase observed when using Metal
cc @li-plus