Skip to content

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower #932

@jon-chuang

Description

@jon-chuang

Preliminary results show that llama.cpp is 1.5x-2x slower than llama-rs. They were both checked to compile with the same arch flags and use the same gnu toolchain.

Summary (on Vicuna 13B, 2048 ctx size, 256 predict tokens):

  • llama.cpp: 430.44 ms per run
  • llama-rs: per_token_duration: 272.793ms

Detailed results

An interesting observation is that CPU util is lower on llama-rs.

System Info:

llama.cpp

> make
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0
I CXX:      g++ (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0

./main
system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama-rs

warning: Using gnu
warning: Using MAVX
warning: Using AVX2
warning: Using FMA
warning: Using F16C
warning: Using SSE3

No BLAS.

Notes: llama-rs bench runs on my branch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions