Inferencing is Dead Slow #155

xdevfaheem · 2023-05-05T07:18:32Z

I'm Using Langchain's Llama-cpp integration to run a LLM. It's So Slow That Each Token Take About a 10-20 seconds to Generate.

My Code is Simple:
from langchain.llms import LlamaCpp
model_path = hf_hub_download(repo_id="TheBloke/stable-vicuna-13B-GGML", filename="stable-vicuna-13B.ggml.q4_2.bin")
llm = LlamaCpp(
model_path= model_path,
n_ctx = 512,
n_parts = -1,
seed = 1337,
f16_kv = True,
logits_all = False,
vocab_only = False,
use_mmap = True,
use_mlock = False,
n_threads = None,
n_batch = 512,
temperature=1.0,
max_tokens=256,
top_p=0.90,
top_k=40,
streaming=True,
last_n_tokens_size = 64,
)
llm("Write a Python Code for Scraping The Given website")

I Saw The Langchain's Llama-cpp Integration Source Code and This Repo's Code to Get Clear Understanding What Was Going On.

After The Last Line is Executed,
it calls call()
which calls generate()
which calls _generate()
it calls _call()
then it calls stream() func
which calls _create_completion() func
which again call generate()
that call sampling helper functiom
after that it calls eval()
finishing with llama_eval() and _lib.llama_eval()

Is There Any Ways to Decrease The Latency of Model or Decrease The Inference Speed With Keeping Above in Mind
Or Can We Shrink The Above Function Calls into one single pipeline

Sorry If There is Any Typos, I'm in a Rush!

See bottom of Colab ScreenShot

xdevfaheem · 2023-05-05T09:43:37Z

@abetlen Please Look Up Here!

SagsMug · 2023-05-05T12:18:00Z

@abetlen Please Look Up Here!

How many cores do you have?
You can check with:

import multiprocessing
print(multiprocessing.cpu_count())

xdevfaheem · 2023-05-05T12:26:21Z

@abetlen Please Look Up Here!

How many cores do you have? You can check with:
import multiprocessing
print(multiprocessing.cpu_count())

2

SagsMug · 2023-05-05T13:28:02Z

@abetlen Please Look Up Here!

How many cores do you have? You can check with:
import multiprocessing
print(multiprocessing.cpu_count())
2

Get yourself a machine with atleast 8 cores, otherwise its gonna be deadslow.

If you want it to be a bit faster set n_threads to 2, as its default behaviour is to half the number of cores to account for hyperthreading.
Meaning you only use 1 core currently.

xdevfaheem · 2023-05-05T13:47:39Z

@abetlen Please Look Up Here!

How many cores do you have? You can check with:
import multiprocessing
print(multiprocessing.cpu_count())
2
Get yourself a machine with atleast 8 cores, otherwise its gonna be deadslow.

If you want it to be a bit faster set n_threads to 2, as its default behaviour is to half the number of cores to account for hyperthreading. Meaning you only use 1 core currently.

Thanks for your Response....

I Think I Should Move to GPTQ instead of GGML

xdevfaheem · 2023-05-05T13:58:47Z

@abetlen Please Look Up Here!

How many cores do you have? You can check with:
import multiprocessing
print(multiprocessing.cpu_count())

How About Modifying The Function Calls?

abetlen · 2023-05-05T15:39:14Z

@TheFaheem yes I think someone else tried to run this in a colab notebook the other day and faced the same issue. I can look into it as well but it's likely just because of the low core count of the default machines there.

xdevfaheem · 2023-05-05T17:01:25Z

@TheFaheem yes I think someone else tried to run this in a colab notebook the other day and faced the same issue. I can look into it as well but it's likely just because of the low core count of the default machines there.

Why So Many Function calls why can't it be simplified to just _call() create_completion() generate() eval() and llama_eval()

abetlen · 2023-05-05T17:27:32Z

@TheFaheem which functions are you referring? The functions inside of llama_cpp.py are a mirror of llama.h in the llama.cpp project, exposing the full API is an explicit goal of this project.

xdevfaheem · 2023-05-05T17:41:44Z

@TheFaheem which functions are you referring? The functions inside of llama_cpp.py are a mirror of llama.h in the llama.cpp project, exposing the full API is an explicit goal of this project.

in llama.py

gjmulder · 2023-05-05T18:12:42Z

I am using the alpaca-lora-65B-GGML/alpaca-lora-65B.ggml.q5_1.bin model on a 16 (physical) core AMD server with an old 1080Ti GPU:

llama-cpp-python timings with an end-to-end (i.e. worst case) call time of 114 seconds as measured by a call to the server REST API:

2023-05-05 17:48:21,431 - INFO - Temp: 0.7, prompt: four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. we are met on a great battle field of that war. we come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. this we may, in all propriety do.
2023-05-05 17:50:15,732 - INFO - API response: but, in a larger sense, we cannot dedicate—we cannot consecrate—we cannot hallow, this ground—the brave men, living and dead, who struggled here, have hallowed it, far above our poor power to add or detract. the world will little note, nor long remember what (1m54s)

INFO:     192.168.1.73:33170 - "GET /v1/models HTTP/1.1" 200 OK

llama_print_timings:        load time = 18498.45 ms
llama_print_timings:      sample time =    40.00 ms /    64 runs   (    0.63 ms per run)
llama_print_timings: prompt eval time = 20896.10 ms /   121 tokens (  172.70 ms per token)
llama_print_timings:        eval time = 90135.39 ms /    63 runs   ( 1430.72 ms per run)
llama_print_timings:       total time = 114295.56 ms

Calling llama.ccp directly with the same prompt and roughly same args:

./main --temp 0.7 -t 16 -m /data/llama/alpaca-lora-65B-GGML/alpaca-lora-65B.ggml.q5_1.bin -b 512 -c 512 -p "Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. We are met on a great battle field of that war. We come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. This we may, in all propriety do." -n 64`
[..]
llama_print_timings:        load time = 23744.10 ms
llama_print_timings:      sample time =    46.10 ms /    64 runs   (    0.72 ms per run)
llama_print_timings: prompt eval time = 21753.05 ms /   121 tokens (  179.78 ms per token)
llama_print_timings:        eval time = 87034.24 ms /    63 runs   ( 1381.50 ms per run)
llama_print_timings:       total time = 110825.60 ms

Caveat: I'm busily training an alpaca-lora on another GPU so not an apples-to-apples comparison in terms of background system load.

xdevfaheem · 2023-05-05T18:28:02Z

I am using the alpaca-lora-65B-GGML/alpaca-lora-65B.ggml.q5_1.bin model on a 16 (physical) core AMD server with an old 1080Ti GPU:

llama-cpp-python timings with an end-to-end (i.e. worst case) call time of 114 seconds as measured by a call to the server REST API:

2023-05-05 17:48:21,431 - INFO - Temp: 0.7, prompt: four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. we are met on a great battle field of that war. we come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. this we may, in all propriety do.
2023-05-05 17:50:15,732 - INFO - API response: but, in a larger sense, we cannot dedicate—we cannot consecrate—we cannot hallow, this ground—the brave men, living and dead, who struggled here, have hallowed it, far above our poor power to add or detract. the world will little note, nor long remember what (1m54s)

INFO:     192.168.1.73:33170 - "GET /v1/models HTTP/1.1" 200 OK

llama_print_timings:        load time = 18498.45 ms
llama_print_timings:      sample time =    40.00 ms /    64 runs   (    0.63 ms per run)
llama_print_timings: prompt eval time = 20896.10 ms /   121 tokens (  172.70 ms per token)
llama_print_timings:        eval time = 90135.39 ms /    63 runs   ( 1430.72 ms per run)
llama_print_timings:       total time = 114295.56 ms

Calling llama.ccp directly with the same prompt and roughly same args:

./main --temp 0.7 -t 16 -m /data/llama/alpaca-lora-65B-GGML/alpaca-lora-65B.ggml.q5_1.bin -b 512 -c 512 -p "Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived, and so dedicated, can long endure. We are met on a great battle field of that war. We come to dedicate a portion of it, as a final resting place for those who died here, that the nation might live. This we may, in all propriety do." -n 64`
[..]
llama_print_timings:        load time = 23744.10 ms
llama_print_timings:      sample time =    46.10 ms /    64 runs   (    0.72 ms per run)
llama_print_timings: prompt eval time = 21753.05 ms /   121 tokens (  179.78 ms per token)
llama_print_timings:        eval time = 87034.24 ms /    63 runs   ( 1381.50 ms per run)
llama_print_timings:       total time = 110825.60 ms

Caveat: I'm busily training an alpaca-lora on another GPU so not an apples-to-apples comparison in terms of background system load.

So You're Saying llama.cpp and its python binding taking almost same time, Right?

Is The 16 Core That You Got is the Reason For this Inference Time?

gjmulder · 2023-05-05T18:36:27Z

So You're Saying llama.cpp and its python binding taking almost same time, Right?

EDIT: 6 milliseconds overhead.

Is The 16 Core That You Got is the Reason For this Inference Time?

Yes, and compiling against CuBLAS for my 1080Ti GPU.

xdevfaheem · 2023-05-06T02:05:06Z

what Should I Do to Decrease The Inference Time?

gjmulder · 2023-05-06T07:00:05Z

In approx. order:

8+ CPUs
Faster memory
GPU

The following GSheet I'm building has a range of models and hardware. You're primarily concerned with the ETA column:

llama.cpp perplexities for the first 406 lines of wiki.test.raw

For info on the benchmark I am using, see: Perplexity (Quality of Generation) Scores #406

raymerjacque · 2023-05-06T09:24:26Z

did something change in the code ? my script used to process multiple inputs at the same time, now i see it is queuing them and not processing one until the previous one is done.

raymerjacque · 2023-05-06T11:24:07Z

i found a nice solution to fix the problems. thank you for the advice. i managed to get my search querries down to under 10 seconds. I implemented a dual web search feature and now the script uses a mixture of websearch and llama querries to return results. takes a load off the server while at the same time also gives the users access to current events and live results while still getting replies from the model.

xdevfaheem · 2023-05-06T12:07:06Z

i found a nice solution to fix the problems. thank you for the advice. i managed to get my search querries down to under 10 seconds. I implemented a dual web search feature and now the script uses a mixture of websearch and llama querries to return results. takes a load off the server while at the same time also gives the users access to current events and live results while still getting replies from the model.

Sounds Interesting. Can You Share Your Codebase?

getorca · 2023-05-08T17:14:26Z

I just want to add some very arbitrary benchmarks, it seems like clock speed matters more than cores. EG:

Both running TheBloke/gpt4-x-vicuna-13B-GGML

On my dual cpu 40 core machine (2 x Intel E7-4850 @ 2.00GHz) with 160gb of ram I get 0.15 t/s
vs
On my 8-core (Ryzen 7 1700 3.0 GHz) consumer machine 64gb or ram I get 1.5 t/s

so 10x the inference speed

gjmulder · 2023-05-08T17:35:28Z

@getorca 10X is a lot!

You have 80GHz on the Intel, 24GHz on your AMD, so a factor of 3X. The memory on the Intel has how many channels, as compared to the AMD?

sudo dmidecode -t 17

getorca · 2023-05-08T19:59:19Z

@gjmulder Yes, had me somewhat surprised, since the server CPUs should be able to do ~3.3x the number of calculations per second.

The consumer build has 2 channels of memory 2 32gb sticks, with a speed of 2667 MT/s and DDR4

The server is an older dell poweredge r810. it has 4 channels per CPU, 8 - 16gb sticks and 4x 8gb (sorry 160gb total ram), but the speed is lower 1066 MT/s, and it's DDR3 SDRAM

I'm not the most knowledgeable with hardware, but I guess there could be lot's of other bottle necks... maybe because the channels are shared for multiple sticks of ram?

AlexiaChen · 2023-05-14T09:41:40Z

python
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing
>>> print(multiprocessing.cpu_count())
24
>>> quit()

24 cores

I also dead slow without cuBLAS. Why? I settled max_tokens is 1024

gjmulder · 2023-05-23T10:46:52Z

#232 for a deep dive into CuBLAS performance

xdevfaheem changed the title ~~Inferencing is So Slow~~ Inferencing is Dead Slow May 5, 2023

abetlen added the performance label May 5, 2023

gjmulder mentioned this issue May 6, 2023

Using Llama-cpp with FastAPI to connect a html. #166

Closed

gjmulder closed this as completed May 23, 2023

gjmulder reopened this May 23, 2023

gjmulder added the duplicate This issue or pull request already exists label May 23, 2023

gjmulder closed this as not planned Won't fix, can't repro, duplicate, stale May 23, 2023

Inferencing is Dead Slow #155

Inferencing is Dead Slow #155

Comments

xdevfaheem commented May 5, 2023

xdevfaheem commented May 5, 2023

Uh oh!

SagsMug commented May 5, 2023

Uh oh!

xdevfaheem commented May 5, 2023

Uh oh!

SagsMug commented May 5, 2023

Uh oh!

xdevfaheem commented May 5, 2023

Uh oh!

xdevfaheem commented May 5, 2023

Uh oh!

abetlen commented May 5, 2023

Uh oh!

xdevfaheem commented May 5, 2023

Uh oh!

abetlen commented May 5, 2023

Uh oh!

xdevfaheem commented May 5, 2023

Uh oh!

gjmulder commented May 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xdevfaheem commented May 5, 2023

Uh oh!

gjmulder commented May 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xdevfaheem commented May 6, 2023

Uh oh!

gjmulder commented May 6, 2023

Uh oh!

raymerjacque commented May 6, 2023

Uh oh!

raymerjacque commented May 6, 2023

Uh oh!

xdevfaheem commented May 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

getorca commented May 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gjmulder commented May 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

getorca commented May 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexiaChen commented May 14, 2023

Uh oh!

gjmulder commented May 23, 2023

Uh oh!

gjmulder commented May 5, 2023 •

edited

Loading

gjmulder commented May 5, 2023 •

edited

Loading

xdevfaheem commented May 6, 2023 •

edited

Loading

getorca commented May 8, 2023 •

edited

Loading

gjmulder commented May 8, 2023 •

edited

Loading

getorca commented May 8, 2023 •

edited

Loading