-
Notifications
You must be signed in to change notification settings - Fork 21
Something weird is going on with -ngl #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
lol that looks strange. I'll look into it:) |
It took a bit longer, my fridge broke down. Note: on low VRAM cards like yours you should use "-b 1". That will allow a lot more layers to be offloaded. Please let me know if all works for your card now, I just simulated it. |
That "no" at the start confused me for a while, but I'm assuming it's supposed to be "on" - in other words, I should be using Your changes seem to have helped but there's still some really weird behavior. If I use
Obviously it didn't really offload 60 layers + the output layer.
That's not 100% crazy but I'm really skeptical that it actually offloaded 5 layers.
I'm sure it didn't offload 6 layers. ( Assuming you're calculating stuff ahead instead of just allocating until it fails, probably the simplest way would be to just calculate the maximum possible number of offloaded layers and just (I haven't looked at the actual code so this suggestion may or may not be super dumb.) |
It's a display bug, I just pushed another commit. Always so many cases to test, I didn't think about ngl > 60 testing because it's not supported at the moment and I wanted to change the behaviour once we support it :)
vram_overhead is used for temporary tensors during evaluation
Lastly: |
No problem, just trying to be helpful. My GPU is such garbage it's actually a performance loss to offload layers so I was just playing with it for testing purposes (prompt ingestion is faster on GPU though so I'd actually set |
It's quite amazing how well falcon performs on CPU. |
@KerfuffleV2 given your CPU focus, can you give the branch I've put into pull requests a test too ? |
Thanks for the suggestion. I've tried it (for llama.cpp at least) and it's a lot slower for prompt or perplexity calculations compared to GPU. From my experimentation, perplexity or prompt ingestion has a massive speedup on GPU but actual inference is slower. This wasn't the case before I upgraded my CPU though: I had a Ryzen 5 1600 - at that point, GPU offloading was an improvement but not with a 5900X.
I can't build it, unfortunately.
That's clang, but I tried with gcc also. Machine is x86 Arch Linux. Also, unfortunately something like that kind of isn't really going to matter until #6 is fixed. #6 makes generating tokens around twice as slow at the 1000 token mark and it keeps getting worse from what I saw. So Falcon can't be used with GGML at the moment for anything except really short generations. |
I used It's the same though, I verified I'm on the right commit (not that I'd expect
And if you're still not convinced:
I screwed around with it a bit by commenting out those lines and adding |
It didn't compile on linux, my bad it was late. |
Thanks, that does fix the compile problem. I didn't get a chance to test it extensively yet, but performance seems at least as good as before and like you've mentioned is less affected by the number of threads. Although using the lowest number to achieve performance is probably best, since it will basically consume 100% of whatever threads it has access to, spinning I guess. 5-6 threads still seems to work best on my 5900X (12 cores, 24 threads but sadly only DDR4 memory). |
If no problems rise up I'll push it into master today or tomorrow. I think this is the first step to allow better cpu thread utilization. In my case with DDR5 and I13 the best setting appears to be physical cores -1 (with the new push it can be physical cores +-1) |
-ngl 0
- offloading 0 of 60 layers to GPU, weights offloaded 0.00 MB-ngl 1
- offloading 1 of 60 layers to GPU, weights offloaded 445.50 MB-ngl 2
- offloading 2 of 60 layers to GPU, weights offloaded 891.00 MB-ngl 3
-Wait, what?
-ngl 4
--ngl 10
-The text was updated successfully, but these errors were encountered: