Rapid memory leak (2MB/s) using Metal backend

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

No memory leak while running the official example `main` program.

# Current Behavior

I compile the latest llama.cpp on an Apple M2 Ultra device using MPS backend.
```sh
make LLAMA_METAL=1 -j
```

Then run the llama q4_0 model in an infinite loop.
```sh
yes | ./main -m ./models/7B/ggml-model-q4_0.gguf -n 512 -ngl 32 -i
```

I use `top` to monitor the memory used by this program. The memory is increasing rapidly by round 2MB/second and there is no sign of stopping.

I have tried adding a `@autoreleasepool` block in `ggml_metal_graph_compute`. The leak slows down but still exists. I am not fluent in objective C and metal. Can anybody help with this?

# Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

* Physical (or virtual) hardware you are using, e.g. for Linux:

```
$ sysctl -a | grep cpu
kern.sched_rt_avoid_cpu0: 0
kern.cpu_checkin_interval: 4000
hw.ncpu: 24
hw.activecpu: 24
hw.perflevel0.physicalcpu: 16
hw.perflevel0.physicalcpu_max: 16
hw.perflevel0.logicalcpu: 16
hw.perflevel0.logicalcpu_max: 16
hw.perflevel0.cpusperl2: 4
hw.perflevel1.physicalcpu: 8
hw.perflevel1.physicalcpu_max: 8
hw.perflevel1.logicalcpu: 8
hw.perflevel1.logicalcpu_max: 8
hw.perflevel1.cpusperl2: 4
hw.physicalcpu: 24
hw.physicalcpu_max: 24
hw.logicalcpu: 24
hw.logicalcpu_max: 24
hw.cputype: 16777228
hw.cpusubtype: 2
hw.cpu64bit_capable: 1
hw.cpufamily: -634136515
hw.cpusubfamily: 5
machdep.cpu.cores_per_package: 24
machdep.cpu.core_count: 24
machdep.cpu.logical_per_package: 24
machdep.cpu.thread_count: 24
machdep.cpu.brand_string: Apple M2 Ultra
```

* Operating System, e.g. for Linux:

```
$ uname -a     
Darwin n79-169-4 22.5.0 Darwin Kernel Version 22.5.0: Thu Jun  8 22:29:35 PDT 2023; root:xnu-8796.121.3~8/RELEASE_ARM64_T6020 arm64
```

* SDK version, e.g. for Linux:

```
$ python3 --version                                             
Python 3.11.4

$ make --version | head -n 1
GNU Make 3.81

$ clang --version | head -n 1
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rapid memory leak (2MB/s) using Metal backend #2761

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rapid memory leak (2MB/s) using Metal backend #2761

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions