Skip to content

Something weird is going on with -ngl #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
KerfuffleV2 opened this issue Jun 19, 2023 · 12 comments
Closed

Something weird is going on with -ngl #12

KerfuffleV2 opened this issue Jun 19, 2023 · 12 comments

Comments

@KerfuffleV2
Copy link
Collaborator

-ngl 0 - offloading 0 of 60 layers to GPU, weights offloaded 0.00 MB

-ngl 1 - offloading 1 of 60 layers to GPU, weights offloaded 445.50 MB

-ngl 2 - offloading 2 of 60 layers to GPU, weights offloaded 891.00 MB

-ngl 3 -

falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 4825.00 MB  of 6050.00 MB (in use: 1224.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
INFO: Not enough VRAM to load all requested layers - at layer 59 of 60: skipping
falcon_model_load_internal: mem required  = 29683.70 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 60 of 60 layers to GPU, weights offloaded 1336.50 MB
falcon_model_load_internal: estimated VRAM usage: 4155 MB
...................................................................................................
falcon_model_load_internal: VRAM free: 3487.00 MB  of 6050.00 MB (used: 2562.00 MB)

Wait, what?

-ngl 4 -

falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 4805.00 MB  of 6050.00 MB (in use: 1244.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
INFO: Not enough VRAM to load all requested layers - at layer 58 of 60: skipping
falcon_model_load_internal: mem required  = 29683.70 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 59 of 60 layers to GPU, weights offloaded 1336.50 MB
falcon_model_load_internal: estimated VRAM usage: 4155 MB
...................................................................................................
falcon_model_load_internal: VRAM free: 3467.00 MB  of 6050.00 MB (used: 2582.00 MB)

-ngl 10 -

falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 4828.00 MB  of 6050.00 MB (in use: 1221.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
INFO: Not enough VRAM to load all requested layers - at layer 52 of 60: skipping
falcon_model_load_internal: mem required  = 29683.70 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 53 of 60 layers to GPU, weights offloaded 1336.50 MB
falcon_model_load_internal: estimated VRAM usage: 4155 MB
...................................................................................................
falcon_model_load_internal: VRAM free: 3490.00 MB  of 6050.00 MB (used: 2559.00 MB)
@cmp-nct
Copy link
Owner

cmp-nct commented Jun 19, 2023

lol that looks strange. I'll look into it:)

@cmp-nct
Copy link
Owner

cmp-nct commented Jun 19, 2023

It took a bit longer, my fridge broke down.
I just pushed a bugfix, it should be good now.

Note: on low VRAM cards like yours you should use "-b 1". That will allow a lot more layers to be offloaded.

Please let me know if all works for your card now, I just simulated it.

@KerfuffleV2
Copy link
Collaborator Author

Note: no low VRAM cards like yours you should use "-b 1". That will allow a lot more layers to be offloaded.

That "no" at the start confused me for a while, but I'm assuming it's supposed to be "on" - in other words, I should be using -b 1. Correct?

Your changes seem to have helped but there's still some really weird behavior. If I use -ngl 1000 (you can use that with llama.cpp at least to try to offload all laers) 👍

INFO: Not enough VRAM to load all requested layers - at layer 5 of 60: skipping
falcon_model_load_internal: mem required  = 28347.20 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 60 of 60 layers to GPU, weights offloaded 2673.00 MB
falcon_model_load_internal: offloading output layer to GPU
falcon_model_load_internal: estimated VRAM usage: 3923 MB
...................................................................................................
falcon_model_load_internal: VRAM free: 2070.00 MB  of 6050.00 MB (used: 3979.00 MB)

Obviously it didn't really offload 60 layers + the output layer.

-ngl 60

falcon_model_load_internal: VRAM free: 4763.00 MB  of 6050.00 MB (in use: 1286.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
INFO: Not enough VRAM to load all requested layers - at layer 5 of 60: skipping
falcon_model_load_internal: mem required  = 28347.20 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 5 of 60 layers to GPU, weights offloaded 2673.00 MB
falcon_model_load_internal: estimated VRAM usage: 3923 MB
...................................................................................................
falcon_model_load_internal: VRAM free: 2087.00 MB  of 6050.00 MB (used: 3962.00 MB)

That's not 100% crazy but I'm really skeptical that it actually offloaded 5 layers.

-ngl 61

falcon_model_load_internal: VRAM free: 4763.00 MB  of 6050.00 MB (in use: 1286.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
INFO: Not enough VRAM to load all requested layers - at layer 5 of 60: skipping
falcon_model_load_internal: mem required  = 28347.20 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 6 of 60 layers to GPU, weights offloaded 2673.00 MB
falcon_model_load_internal: estimated VRAM usage: 3923 MB
...................................................................................................
falcon_model_load_internal: VRAM free: 2087.00 MB  of 6050.00 MB (used: 3962.00 MB)

I'm sure it didn't offload 6 layers. (-ngl 65 claims to be offloading 10 layers - not believable.) Also, it doesn't make sense that specifying different -ngl values that are way past the number of layers that can actually be offloaded should change anything. So something is weird with the calculation.

Assuming you're calculating stuff ahead instead of just allocating until it fails, probably the simplest way would be to just calculate the maximum possible number of offloaded layers and just MIN(ngl_value, max_possible_offload_layers) or something like that.

(I haven't looked at the actual code so this suggestion may or may not be super dumb.)

@cmp-nct
Copy link
Owner

cmp-nct commented Jun 19, 2023

It's a display bug, I just pushed another commit. Always so many cases to test, I didn't think about ngl > 60 testing because it's not supported at the moment and I wanted to change the behaviour once we support it :)

  1. It is possible to specify > n_layers as -ngl in llama.cpp but it's disabled (not working yet) on falcon
    n_layers +1 = output tensor
    n_layers +2 = normalization tensor
    Though the calculations are not working yet, so that is disabled.

  2. The "offloading X of X" message is correct, you should also notice a performance boost.
    For 6 GB cards I believe there is a bit of headroom if you want to optimize it.
    vram_overhead and vram_reserved in libfalcon.cpp are both larger than they need to be.
    So for a small VRAM card you can try reduce them a bit until you notice a performance hit or unreliable performance.

vram_overhead is used for temporary tensors during evaluation
vram_reserved is used for system memory.
I set both a bit more generous than needed because I did have some cases which required it, if they are too small sudden performance loss can happen during evaluation, up to full stall.

  1. The recommended way of using falcon is with -ngl 60
    It automatically uses as many tensors as your VRAM permits, so you can use the -ngl 60 for every model, it should always adapt

Lastly:
Regarding calculating ahead: I wanted to do that but I settled for a simpler approach. It calculates one layer ahead, so it checks if the next layer will fit based on the memory requirements of the current layer.
If the next layer will not fit (taking real free vram, overhead and reserved into account) then it will stop offloading.
In that case it will not GPU accelerate the "last N" tensors, it will accelerated some tensors in between.
I'm not sure why Johannes had chosen to accelerate the last N tensors, it makes it more complicated but I didn't want to change that without knowing the reasoning.

@KerfuffleV2
Copy link
Collaborator Author

No problem, just trying to be helpful. My GPU is such garbage it's actually a performance loss to offload layers so I was just playing with it for testing purposes (prompt ingestion is faster on GPU though so I'd actually set -b pretty high and -ngl 0).

@cmp-nct
Copy link
Owner

cmp-nct commented Jun 20, 2023

It's quite amazing how well falcon performs on CPU.
You could also give OpenBLAS a try, if you didn't yet. It's known to speed up CPU prompt ingestion (it's useless for anything else though)

@cmp-nct
Copy link
Owner

cmp-nct commented Jun 20, 2023

@KerfuffleV2 given your CPU focus, can you give the branch I've put into pull requests a test too ?
I've 9% performance improvements on 13th gen Intel

@KerfuffleV2
Copy link
Collaborator Author

You could also give OpenBLAS a try, if you didn't yet.

Thanks for the suggestion. I've tried it (for llama.cpp at least) and it's a lot slower for prompt or perplexity calculations compared to GPU. From my experimentation, perplexity or prompt ingestion has a massive speedup on GPU but actual inference is slower. This wasn't the case before I upgraded my CPU though: I had a Ryzen 5 1600 - at that point, GPU offloading was an improvement but not with a 5900X.

given your CPU focus, can you give the branch I've put into pull requests a test too ?

I can't build it, unfortunately.

In file included from /home/nope/personal/ai/ggllm.cpp/ggml.c:89:
/usr/lib/gcc/x86_64-pc-linux-gnu/13.1.1/include/stdatomic.h:40:23: error: conflicting type qualifiers for ‘atomic_bool’
   40 | typedef _Atomic _Bool atomic_bool;
      |                       ^~~~~~~~~~~
In file included from /home/nope/personal/ai/ggllm.cpp/ggml.c:4:
/home/nope/personal/ai/ggllm.cpp/ggml.h:224:24: note: previous declaration of ‘atomic_bool’ with type ‘atomic_bool’ {aka ‘volatile long int’}
  224 |     typedef atomic_int atomic_bool;
      |                        ^~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13.1.1/include/stdatomic.h:46:21: error: conflicting type qualifiers for ‘atomic_int’
   46 | typedef _Atomic int atomic_int;
      |                     ^~~~~~~~~~
/home/nope/personal/ai/ggllm.cpp/ggml.h:223:27: note: previous declaration of ‘atomic_int’ with type ‘atomic_int’ {aka ‘volatile long int’}
  223 |     typedef volatile LONG atomic_int;
      |     

That's clang, but I tried with gcc also. Machine is x86 Arch Linux.

Also, unfortunately something like that kind of isn't really going to matter until #6 is fixed. #6 makes generating tokens around twice as slow at the 1000 token mark and it keeps getting worse from what I saw.

So Falcon can't be used with GGML at the moment for anything except really short generations.

@KerfuffleV2
Copy link
Collaborator Author

How did you test it ?

I used gh - gh pr checkout 16

It's the same though, I verified I'm on the right commit (not that I'd expect gh to fail at checking out a PR):

commit c09786bac5405a49ec8cf70d3db654c321f70617 (HEAD -> test-ggml-compute-chunks, origin/test-ggml-compute-chunks)
Author: John <[email protected]>
Date:   Tue Jun 20 04:01:09 2023 +0200

    test

commit dd80fb53201ec91dcf6cab987c9e8ed4a4457a9a
Author: John <[email protected]>
Date:   Tue Jun 20 04:00:34 2023 +0200

    chunked RMS and mulmat for testing

And if you're still not convinced:

95ac3ebf3ace1f47042ac6f8c0a69cb0235d21a45834759c08cd9a6f56b19864  ../ggml.c
7bb3b2753cd7cee63b236cc8a49aacd8a71c2103e8993ee8e78ad52785769286  ../ggml.h

I screwed around with it a bit by commenting out those lines and adding #include <stdatomic.h> to ggml.h but wasn't able to get it to compile. It just dies a bit later on trying to compile libfalcon. Might be because Arch has very up to date packages compared to a lot of distros.

@cmp-nct
Copy link
Owner

cmp-nct commented Jun 20, 2023

It didn't compile on linux, my bad it was late.
I've to look into github actions to auto compile.
I just pushed a commit, it's not the super cleanest solution but compiles fine now.
I've the same speed improvement on wsl as on windows

@KerfuffleV2
Copy link
Collaborator Author

Thanks, that does fix the compile problem. I didn't get a chance to test it extensively yet, but performance seems at least as good as before and like you've mentioned is less affected by the number of threads. Although using the lowest number to achieve performance is probably best, since it will basically consume 100% of whatever threads it has access to, spinning I guess.

5-6 threads still seems to work best on my 5900X (12 cores, 24 threads but sadly only DDR4 memory).

@cmp-nct
Copy link
Owner

cmp-nct commented Jun 20, 2023

If no problems rise up I'll push it into master today or tomorrow. I think this is the first step to allow better cpu thread utilization.
Especially on mixed core CPUs that brings a significant boost and I'm quite sure that better thread management could bring another 10% or more. Not sure if there is a downside for other CPUs.

In my case with DDR5 and I13 the best setting appears to be physical cores -1 (with the new push it can be physical cores +-1)

@cmp-nct cmp-nct closed this as completed Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants