Skip to content

Inferencing is Dead Slow #155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xdevfaheem opened this issue May 5, 2023 · 23 comments
Closed

Inferencing is Dead Slow #155

xdevfaheem opened this issue May 5, 2023 · 23 comments
Labels
duplicate This issue or pull request already exists performance

Comments

@xdevfaheem
Copy link

I'm Using Langchain's Llama-cpp integration to run a LLM. It's So Slow That Each Token Take About a 10-20 seconds to Generate.

My Code is Simple:
from langchain.llms import LlamaCpp
model_path = hf_hub_download(repo_id="TheBloke/stable-vicuna-13B-GGML", filename="stable-vicuna-13B.ggml.q4_2.bin")
llm = LlamaCpp(
model_path= model_path,
n_ctx = 512,
n_parts = -1,
seed = 1337,
f16_kv = True,
logits_all = False,
vocab_only = False,
use_mmap = True,
use_mlock = False,
n_threads = None,
n_batch = 512,
temperature=1.0,
max_tokens=256,
top_p=0.90,
top_k=40,
streaming=True,
last_n_tokens_size = 64,
)
llm("Write a Python Code for Scraping The Given website")

I Saw The Langchain's Llama-cpp Integration Source Code and This Repo's Code to Get Clear Understanding What Was Going On.

After The Last Line is Executed,
it calls call()
which calls generate()
which calls _generate()
it calls _call()
then it calls stream() func
which calls _create_completion() func
which again call generate()
that call sampling helper functiom
after that it calls eval()
finishing with llama_eval() and _lib.llama_eval()

Is There Any Ways to Decrease The Latency of Model or Decrease The Inference Speed With Keeping Above in Mind
Or Can We Shrink The Above Function Calls into one single pipeline

Sorry If There is Any Typos, I'm in a Rush!

See bottom of Colab ScreenShot
Screenshot from 2023-05-05 12-06-29

@xdevfaheem xdevfaheem changed the title Inferencing is So Slow Inferencing is Dead Slow May 5, 2023
@xdevfaheem
Copy link
Author

@abetlen Please Look Up Here!

@SagsMug
Copy link
Contributor

SagsMug commented May 5, 2023

@abetlen Please Look Up Here!

How many cores do you have?
You can check with:

import multiprocessing
print(multiprocessing.cpu_count())

@xdevfaheem
Copy link
Author

@abetlen Please Look Up Here!

How many cores do you have? You can check with:

import multiprocessing
print(multiprocessing.cpu_count())

2

@SagsMug
Copy link
Contributor

SagsMug commented May 5, 2023

@abetlen Please Look Up Here!

How many cores do you have? You can check with:

import multiprocessing
print(multiprocessing.cpu_count())

2

Get yourself a machine with atleast 8 cores, otherwise its gonna be deadslow.

If you want it to be a bit faster set n_threads to 2, as its default behaviour is to half the number of cores to account for hyperthreading.
Meaning you only use 1 core currently.

@xdevfaheem
Copy link
Author

@abetlen Please Look Up Here!

How many cores do you have? You can check with:

import multiprocessing
print(multiprocessing.cpu_count())

2

Get yourself a machine with atleast 8 cores, otherwise its gonna be deadslow.

If you want it to be a bit faster set n_threads to 2, as its default behaviour is to half the number of cores to account for hyperthreading. Meaning you only use 1 core currently.

Thanks for your Response....

I Think I Should Move to GPTQ instead of GGML

@xdevfaheem
Copy link
Author

@abetlen Please Look Up Here!

How many cores do you have? You can check with:

import multiprocessing
print(multiprocessing.cpu_count())

How About Modifying The Function Calls?

@abetlen
Copy link
Owner

abetlen commented May 5, 2023

@TheFaheem yes I think someone else tried to run this in a colab notebook the other day and faced the same issue. I can look into it as well but it's likely just because of the low core count of the default machines there.

@xdevfaheem
Copy link
Author

@TheFaheem yes I think someone else tried to run this in a colab notebook the other day and faced the same issue. I can look into it as well but it's likely just because of the low core count of the default machines there.

Why So Many Function calls why can't it be simplified to just _call() create_completion() generate() eval() and llama_eval()

@abetlen
Copy link
Owner

abetlen commented May 5, 2023

@TheFaheem which functions are you referring? The functions inside of llama_cpp.py are a mirror of llama.h in the llama.cpp project, exposing the full API is an explicit goal of this project.

@xdevfaheem
Copy link
Author

@TheFaheem which functions are you referring? The functions inside of llama_cpp.py are a mirror of llama.h in the llama.cpp project, exposing the full API is an explicit goal of this project.

in llama.py

@gjmulder
Copy link
Contributor

gjmulder commented May 5, 2023

I am using the alpaca-lora-65B-GGML/alpaca-lora-65B.ggml.q5_1.bin model on a 16 (physical) core AMD server with an old 1080Ti GPU:

llama-cpp-python timings with an end-to-end (i.e. worst case) call time of 114 seconds as measured by a call to the server REST API:

2023-05-05 17:48:21,431 - INFO - Temp: 0.7, prompt: four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. we are met on a great battle field of that war. we come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. this we may, in all propriety do.
2023-05-05 17:50:15,732 - INFO - API response: but, in a larger sense, we cannot dedicate—we cannot consecrate—we cannot hallow, this ground—the brave men, living and dead, who struggled here, have hallowed it, far above our poor power to add or detract. the world will little note, nor long remember what (1m54s)
INFO:     192.168.1.73:33170 - "GET /v1/models HTTP/1.1" 200 OK

llama_print_timings:        load time = 18498.45 ms
llama_print_timings:      sample time =    40.00 ms /    64 runs   (    0.63 ms per run)
llama_print_timings: prompt eval time = 20896.10 ms /   121 tokens (  172.70 ms per token)
llama_print_timings:        eval time = 90135.39 ms /    63 runs   ( 1430.72 ms per run)
llama_print_timings:       total time = 114295.56 ms

Calling llama.ccp directly with the same prompt and roughly same args:

./main --temp 0.7 -t 16 -m /data/llama/alpaca-lora-65B-GGML/alpaca-lora-65B.ggml.q5_1.bin -b 512 -c 512 -p "Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. We are met on a great battle field of that war. We come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. This we may, in all propriety do." -n 64`
[..]
llama_print_timings:        load time = 23744.10 ms
llama_print_timings:      sample time =    46.10 ms /    64 runs   (    0.72 ms per run)
llama_print_timings: prompt eval time = 21753.05 ms /   121 tokens (  179.78 ms per token)
llama_print_timings:        eval time = 87034.24 ms /    63 runs   ( 1381.50 ms per run)
llama_print_timings:       total time = 110825.60 ms

Caveat: I'm busily training an alpaca-lora on another GPU so not an apples-to-apples comparison in terms of background system load.

@xdevfaheem
Copy link
Author

I am using the alpaca-lora-65B-GGML/alpaca-lora-65B.ggml.q5_1.bin model on a 16 (physical) core AMD server with an old 1080Ti GPU:

llama-cpp-python timings with an end-to-end (i.e. worst case) call time of 114 seconds as measured by a call to the server REST API:

2023-05-05 17:48:21,431 - INFO - Temp: 0.7, prompt: four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. we are met on a great battle field of that war. we come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. this we may, in all propriety do.
2023-05-05 17:50:15,732 - INFO - API response: but, in a larger sense, we cannot dedicate—we cannot consecrate—we cannot hallow, this ground—the brave men, living and dead, who struggled here, have hallowed it, far above our poor power to add or detract. the world will little note, nor long remember what (1m54s)
INFO:     192.168.1.73:33170 - "GET /v1/models HTTP/1.1" 200 OK

llama_print_timings:        load time = 18498.45 ms
llama_print_timings:      sample time =    40.00 ms /    64 runs   (    0.63 ms per run)
llama_print_timings: prompt eval time = 20896.10 ms /   121 tokens (  172.70 ms per token)
llama_print_timings:        eval time = 90135.39 ms /    63 runs   ( 1430.72 ms per run)
llama_print_timings:       total time = 114295.56 ms

Calling llama.ccp directly with the same prompt and roughly same args:

./main --temp 0.7 -t 16 -m /data/llama/alpaca-lora-65B-GGML/alpaca-lora-65B.ggml.q5_1.bin -b 512 -c 512 -p "Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. We are met on a great battle field of that war. We come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. This we may, in all propriety do." -n 64`
[..]
llama_print_timings:        load time = 23744.10 ms
llama_print_timings:      sample time =    46.10 ms /    64 runs   (    0.72 ms per run)
llama_print_timings: prompt eval time = 21753.05 ms /   121 tokens (  179.78 ms per token)
llama_print_timings:        eval time = 87034.24 ms /    63 runs   ( 1381.50 ms per run)
llama_print_timings:       total time = 110825.60 ms

Caveat: I'm busily training an alpaca-lora on another GPU so not an apples-to-apples comparison in terms of background system load.

So You're Saying llama.cpp and its python binding taking almost same time, Right?

Is The 16 Core That You Got is the Reason For this Inference Time?

@gjmulder
Copy link
Contributor

gjmulder commented May 5, 2023

So You're Saying llama.cpp and its python binding taking almost same time, Right?

EDIT: 6 milliseconds overhead.

Is The 16 Core That You Got is the Reason For this Inference Time?

Yes, and compiling against CuBLAS for my 1080Ti GPU.

@xdevfaheem
Copy link
Author

what Should I Do to Decrease The Inference Time?

@gjmulder
Copy link
Contributor

gjmulder commented May 6, 2023

In approx. order:

  1. 8+ CPUs
  2. Faster memory
  3. GPU

The following GSheet I'm building has a range of models and hardware. You're primarily concerned with the ETA column:

llama.cpp perplexities for the first 406 lines of wiki.test.raw

For info on the benchmark I am using, see: Perplexity (Quality of Generation) Scores #406

@raymerjacque
Copy link

did something change in the code ? my script used to process multiple inputs at the same time, now i see it is queuing them and not processing one until the previous one is done.

@raymerjacque
Copy link

i found a nice solution to fix the problems. thank you for the advice. i managed to get my search querries down to under 10 seconds. I implemented a dual web search feature and now the script uses a mixture of websearch and llama querries to return results. takes a load off the server while at the same time also gives the users access to current events and live results while still getting replies from the model.

@xdevfaheem
Copy link
Author

xdevfaheem commented May 6, 2023

i found a nice solution to fix the problems. thank you for the advice. i managed to get my search querries down to under 10 seconds. I implemented a dual web search feature and now the script uses a mixture of websearch and llama querries to return results. takes a load off the server while at the same time also gives the users access to current events and live results while still getting replies from the model.

Sounds Interesting. Can You Share Your Codebase?

@getorca
Copy link

getorca commented May 8, 2023

I just want to add some very arbitrary benchmarks, it seems like clock speed matters more than cores. EG:

Both running TheBloke/gpt4-x-vicuna-13B-GGML

On my dual cpu 40 core machine (2 x Intel E7-4850 @ 2.00GHz) with 160gb of ram I get 0.15 t/s
vs
On my 8-core (Ryzen 7 1700 3.0 GHz) consumer machine 64gb or ram I get 1.5 t/s

so 10x the inference speed

@gjmulder
Copy link
Contributor

gjmulder commented May 8, 2023

@getorca 10X is a lot!

You have 80GHz on the Intel, 24GHz on your AMD, so a factor of 3X. The memory on the Intel has how many channels, as compared to the AMD?

sudo dmidecode -t 17

@getorca
Copy link

getorca commented May 8, 2023

@gjmulder Yes, had me somewhat surprised, since the server CPUs should be able to do ~3.3x the number of calculations per second.

The consumer build has 2 channels of memory 2 32gb sticks, with a speed of 2667 MT/s and DDR4

The server is an older dell poweredge r810. it has 4 channels per CPU, 8 - 16gb sticks and 4x 8gb (sorry 160gb total ram), but the speed is lower 1066 MT/s, and it's DDR3 SDRAM

I'm not the most knowledgeable with hardware, but I guess there could be lot's of other bottle necks... maybe because the channels are shared for multiple sticks of ram?

@AlexiaChen
Copy link

python
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing
>>> print(multiprocessing.cpu_count())
24
>>> quit()

24 cores

I also dead slow without cuBLAS. Why? I settled max_tokens is 1024

@gjmulder
Copy link
Contributor

#232 for a deep dive into CuBLAS performance

@gjmulder gjmulder reopened this May 23, 2023
@gjmulder gjmulder added the duplicate This issue or pull request already exists label May 23, 2023
@gjmulder gjmulder closed this as not planned Won't fix, can't repro, duplicate, stale May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists performance
Projects
None yet
Development

No branches or pull requests

7 participants