-
Notifications
You must be signed in to change notification settings - Fork 12.9k
Closed
Description
Preliminary results show that llama.cpp
is 1.5x-2x slower than llama-rs
. They were both checked to compile with the same arch flags and use the same gnu toolchain.
Summary (on Vicuna 13B, 2048 ctx size, 256 predict tokens
):
llama.cpp
: 430.44 ms per runllama-rs
: per_token_duration: 272.793ms
An interesting observation is that CPU util is lower on llama-rs.
System Info:
llama.cpp
> make
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: x86_64
I UNAME_M: x86_64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC: cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
I CXX: g++ (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
./main
system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama-rs
warning: Using gnu
warning: Using MAVX
warning: Using AVX2
warning: Using FMA
warning: Using F16C
warning: Using SSE3
No BLAS.
Notes: llama-rs bench runs on my branch.
ggerganov
Metadata
Metadata
Assignees
Labels
No labels