You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp) #767
I can load a 13B model and generate text with it with decent token generation speed with a M1 Pro CPU (16 GB RAM).
Current Behavior
When I load a 13B model with llama.cpp (like Alpaca 13B or other models based on it) and I try to generate some text, every token generation needs several seconds, to the point that these models are not usable for how unbearably slow they are. But they works with reasonable speed using Dalai, that uses an older version of llama.cpp
Environment and Context
MacBook Pro with M1 Pro, 16 GB RAM, macOS Ventura 13.3.
Python 3.9.16
GNU Make 3.81
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.4.0
Thread model: posix
If you need some kind of log or other informations, I will post everything you need. Thanks in advance.