Skip to content

warning: failed to mlock NNNNNN-byte buffer (after previously locking 0 bytes): Cannot allocate memory #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AnonymousAmalgrams opened this issue May 21, 2023 · 11 comments
Labels
documentation Improvements or additions to documentation

Comments

@AnonymousAmalgrams
Copy link

AnonymousAmalgrams commented May 21, 2023

I'm getting the following output when running the web server from the git clone:

llama.cpp: loading model from ./vendor/llama.cpp/models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4017.34 MB
llama_model_load_internal: mem required  = 5809.34 MB (+ 17592185987986.00 MB per state)
warning: failed to mlock 4212486144-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MLOCK ('ulimit -l' as root).

I manually built the libllama.so file and dumped it into the directory where it checks for it. I tried building it both following #30 and also with make libllama.so, both maybe as expected give the same result. However oddly enough, the pip install seems to work fine (not sure what it's doing differently) and gives the same "normal" ctx size (around 70KB) as running the model directly within vendor/llama.cpp and the -n 128 suggested for testing. Any suggestions for how to get a working libllama.so would be greatly appreciated.

@gjmulder gjmulder added enhancement New feature or request question Further information is requested labels May 21, 2023
@gjmulder gjmulder changed the title Massive ctx? warning: failed to mlock NNNNNN-byte buffer (after previously locking 0 bytes): Cannot allocate memory May 21, 2023
@gjmulder
Copy link
Contributor

gjmulder commented May 21, 2023

Ahh, saw the error. If you format your screen output and refer to the actual error in your issue description as per the template it helps people understand your issue easier.

Here's the fix, which is not directly related to n_ctx. With mlock enabled you are hitting the default mlock memory limits for your Linux distro:

ulimit -l unlimited && python3 llama_cpp.server

There is a system-wide ulimit.conf file which varies in location depending on the distro.

@gjmulder
Copy link
Contributor

#171

@gjmulder gjmulder added duplicate This issue or pull request already exists documentation Improvements or additions to documentation and removed question Further information is requested enhancement New feature or request duplicate This issue or pull request already exists labels May 21, 2023
@AnonymousAmalgrams
Copy link
Author

Ahh, saw the error. If you format your screen output and refer to the actual error in your issue description as per the template it helps people understand your issue easier.

Here's the fix, which is not directly related to n_ctx. With mlock enabled you are hitting the default mlock memory limits for your Linux distro:

ulimit -l unlimited && python3 llama_cpp.server

There is a system-wide ulimit.conf file which varies in location depending on the distro.

In the issue mentioned, they pasted an image of the output and still have a ctx of around 70KB and a correspondingly much smaller mem required than “+ 17592185987986.00 MB per state”. It lines up with what I observed running the llama.cpp version on its own vs the parameters that were somehow saved into the .so. If I’m interpreting that correctly I don’t think I would ever be able to get enough memory to run this even if I disabled mlock with those requirements, and worry for my computer if I tried 😅.

@gjmulder
Copy link
Contributor

gjmulder commented May 21, 2023

In the issue mentioned, they pasted an image of the output and still have a ctx of around 70KB and a correspondingly much smaller mem required than “+ 17592185987986.00 MB per state”. It lines up with what I observed running the llama.cpp version on its own vs the parameters that were somehow saved into the .so. If I’m interpreting that correctly I don’t think I would ever be able to get enough memory to run this even if I disabled mlock with those requirements, and worry for my computer if I trie

17592185987986.00 MB (17.6 exabytes) is clearly a bug. mlock just forces the memory allocated to lllama.cpp to not be swapped out. This is approx. how much memory you need:

Model Original size Quantized size (4-bit)
7B 13 GB 4 GB
13B 24 GB 8 GB
30B 60 GB 20 GB
65B 120 GB 39 GB

From llama.ccp CPU memory / disk reqs

You can adjust n_gpu_layers to fill your GPU with as many layers as you have VRAM available.

@AnonymousAmalgrams
Copy link
Author

Thanks, so as I understand it the n_gpu_layers will limit the amount of swap used to the associated VRAM amount? Just trying to make sure I'm not about to blow anything out.

@AnonymousAmalgrams
Copy link
Author

In the issue mentioned, they pasted an image of the output and still have a ctx of around 70KB and a correspondingly much smaller mem required than “+ 17592185987986.00 MB per state”. It lines up with what I observed running the llama.cpp version on its own vs the parameters that were somehow saved into the .so. If I’m interpreting that correctly I don’t think I would ever be able to get enough memory to run this even if I disabled mlock with those requirements, and worry for my computer if I trie

17592185987986.00 MB (17.6 exabytes) is clearly a bug. mlock just forces the memory allocated to lllama.cpp to not be swapped out. This is approx. how much memory you need:

Model Original size Quantized size (4-bit)
7B 13 GB 4 GB
13B 24 GB 8 GB
30B 60 GB 20 GB
65B 120 GB 39 GB
From llama.ccp CPU memory / disk reqs

You can adjust n_gpu_layers to fill your GPU with as many layers as you have VRAM available.

Actually now that I think of it though, isn't it kind of odd that swap is not needed in some cases then? I ran all testing on the same system, so my impression is that the same swap limits would be imposed.

@gjmulder
Copy link
Contributor

Generally you don't want the OS to swap the model out, but it may try to given the large memory footprint. mlock can be used to explicitly request the OS not to swap it out. On a Linux Mint system I saw the stupid OOM Killer preemptively killing llama.cpp, even though the system had enough physical RAM.

The VRAM usage is AFAIK independent of the mlock behaviour. In my experience the n_ctx size uses a set amount of VRAM. Whatever is left over can be allocated to n_gpu_layers. You'll want to leave some MB free for minor VRAM growth, although #223 indicates that there's a bug in llama.cpp where VRAM isn't freed even if you force python to garbage collect the model instance.

@gdedrouas
Copy link

llama_context struct changed in llama.cpp, updating your llama-cpp-python to current main should fix it.

@AnonymousAmalgrams
Copy link
Author

Okay, I managed to get it working now. Thanks again!

xaptronic pushed a commit to xaptronic/llama-cpp-python that referenced this issue Jun 13, 2023
This causes long prompts to parse very slowly.
@pavelklymenko
Copy link

Okay, I managed to get it working now. Thanks again!

@AnonymousAmalgrams What did you do?

@AnonymousAmalgrams
Copy link
Author

I just updated the repo… but there are probably a lot of other random things that can cause this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants