Running on an A100 node #3359
Replies: 2 comments 7 replies
-
Are the gpus interconnected using NVLink or PCIe? Is it possible to rebuild with |
Beta Was this translation helpful? Give feedback.
-
@ggerganov How did you get "143.43 tokens per second" with CUDA_VISIBLE_DEVICES=0 ? Can you share your command, model and settings? I can get "109.17 tokens per second". thanks CUDA_VISIBLE_DEVICES=1 ./main -m models/models--TheBloke--Llama-2-7b-Chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf -i --interactive-first -ngl 40 -n 50 Log start llama_print_timings: load time = 2196.50 ms |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
[OUTDATED]
I currently have access to a node with 8x A100 and doing some experiments, decided to share some of the results.
Slow without
CUDA_VISIBLE_DEVICES=0
Not sure why, but if I run
main
without setting the environmentCUDA_VISIBLE_DEVICES=0
, the performance is ~8 times worse compared to when setting it:Any ideas what is causing this?
Performance benchmarks
LLAMA_CUDA_MMV_Y=2
seems to slightly improve the performanceLLAMA_CUDA_DMMV_X=64
also slightly improves the performance-mmq 0
(-nommq) significantly improves prefill speedCMAKE_CUDA_ARCHITECTURES=native
build: 39ddda2 (1301)
build: 39ddda2 (1301)
build: 48edda3 (1330)
For reference, here is the same test on M2 Ultra
build: 99115f3 (1273)
build: 99115f3 (1273)
real 3m2.119s
user 0m8.147s
sys 0m8.614s
Beta Was this translation helpful? Give feedback.
All reactions