-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Inferencing is Dead Slow #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@abetlen Please Look Up Here! |
How many cores do you have? import multiprocessing
print(multiprocessing.cpu_count()) |
2 |
Get yourself a machine with atleast 8 cores, otherwise its gonna be deadslow. If you want it to be a bit faster set n_threads to 2, as its default behaviour is to half the number of cores to account for hyperthreading. |
Thanks for your Response.... I Think I Should Move to GPTQ instead of GGML |
How About Modifying The Function Calls? |
@TheFaheem yes I think someone else tried to run this in a colab notebook the other day and faced the same issue. I can look into it as well but it's likely just because of the low core count of the default machines there. |
Why So Many Function calls why can't it be simplified to just _call() create_completion() generate() eval() and llama_eval() |
@TheFaheem which functions are you referring? The functions inside of |
in llama.py |
I am using the alpaca-lora-65B-GGML/alpaca-lora-65B.ggml.q5_1.bin model on a 16 (physical) core AMD server with an old 1080Ti GPU:
Calling
Caveat: I'm busily training an alpaca-lora on another GPU so not an apples-to-apples comparison in terms of background system load. |
So You're Saying llama.cpp and its python binding taking almost same time, Right? Is The 16 Core That You Got is the Reason For this Inference Time? |
EDIT: 6 milliseconds overhead.
Yes, and compiling against CuBLAS for my 1080Ti GPU. |
what Should I Do to Decrease The Inference Time? |
In approx. order:
The following GSheet I'm building has a range of models and hardware. You're primarily concerned with the ETA column: llama.cpp perplexities for the first 406 lines of wiki.test.raw For info on the benchmark I am using, see: Perplexity (Quality of Generation) Scores #406 |
did something change in the code ? my script used to process multiple inputs at the same time, now i see it is queuing them and not processing one until the previous one is done. |
i found a nice solution to fix the problems. thank you for the advice. i managed to get my search querries down to under 10 seconds. I implemented a dual web search feature and now the script uses a mixture of websearch and llama querries to return results. takes a load off the server while at the same time also gives the users access to current events and live results while still getting replies from the model. |
Sounds Interesting. Can You Share Your Codebase? |
I just want to add some very arbitrary benchmarks, it seems like clock speed matters more than cores. EG: Both running On my dual cpu 40 core machine (2 x Intel E7-4850 @ 2.00GHz) with 160gb of ram I get 0.15 t/s so 10x the inference speed |
@getorca 10X is a lot! You have 80GHz on the Intel, 24GHz on your AMD, so a factor of 3X. The memory on the Intel has how many channels, as compared to the AMD?
|
@gjmulder Yes, had me somewhat surprised, since the server CPUs should be able to do ~3.3x the number of calculations per second. The consumer build has 2 channels of memory 2 32gb sticks, with a speed of 2667 MT/s and DDR4 The server is an older dell poweredge r810. it has 4 channels per CPU, 8 - 16gb sticks and 4x 8gb (sorry 160gb total ram), but the speed is lower 1066 MT/s, and it's DDR3 SDRAM I'm not the most knowledgeable with hardware, but I guess there could be lot's of other bottle necks... maybe because the channels are shared for multiple sticks of ram? |
python
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing
>>> print(multiprocessing.cpu_count())
24
>>> quit() 24 cores I also dead slow without cuBLAS. Why? I settled max_tokens is 1024 |
#232 for a deep dive into CuBLAS performance |
I'm Using Langchain's Llama-cpp integration to run a LLM. It's So Slow That Each Token Take About a 10-20 seconds to Generate.
My Code is Simple:
from langchain.llms import LlamaCpp
model_path = hf_hub_download(repo_id="TheBloke/stable-vicuna-13B-GGML", filename="stable-vicuna-13B.ggml.q4_2.bin")
llm = LlamaCpp(
model_path= model_path,
n_ctx = 512,
n_parts = -1,
seed = 1337,
f16_kv = True,
logits_all = False,
vocab_only = False,
use_mmap = True,
use_mlock = False,
n_threads = None,
n_batch = 512,
temperature=1.0,
max_tokens=256,
top_p=0.90,
top_k=40,
streaming=True,
last_n_tokens_size = 64,
)
llm("Write a Python Code for Scraping The Given website")
I Saw The Langchain's Llama-cpp Integration Source Code and This Repo's Code to Get Clear Understanding What Was Going On.
After The Last Line is Executed,
it calls call()
which calls generate()
which calls _generate()
it calls _call()
then it calls stream() func
which calls _create_completion() func
which again call generate()
that call sampling helper functiom
after that it calls eval()
finishing with llama_eval() and _lib.llama_eval()
Is There Any Ways to Decrease The Latency of Model or Decrease The Inference Speed With Keeping Above in Mind
Or Can We Shrink The Above Function Calls into one single pipeline
Sorry If There is Any Typos, I'm in a Rush!
See bottom of Colab ScreenShot

The text was updated successfully, but these errors were encountered: